Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Remove tha artifact_name argument from ds.log_to_mlflow() #563

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

kbolashev
Copy link
Member

This PR makes it so the users can no longer supply the artifact_name argument to the ds.log_to_mlflow() function.

@kbolashev kbolashev added enhancement New feature or request UX labels Dec 11, 2024
@kbolashev kbolashev requested a review from simonlsk December 11, 2024 13:47
@kbolashev kbolashev self-assigned this Dec 11, 2024
Copy link

dagshub bot commented Dec 11, 2024

Copy link
Contributor

@simonlsk simonlsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rethink a few details before we potentially break the client API more than once.

dagshub/data_engine/datasources.py Show resolved Hide resolved
Comment on lines 193 to 194
if artifact_name is None:
raise ValueError("artifact_name must be specified")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the signature Optional ? Can you explain?

Copy link
Member Author

@kbolashev kbolashev Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that making artifact_name not optional means that it has to be put BEFORE the run argument in the function signature. So instead of

def get_from_mlflow(run, artifact_name)

You'll have to make it

def get_from_mlflow(artifact_name, run)

Which will break the interface in a weirder way than having this ValueError check.
I could take Optional out of the artifact_name type signature, but then assigning it to None immediately would look at the very least weird (and also mypy would scream at me)

Run to which the artifact was logged.
"""

now_time = datetime.datetime.now.strftime("%Y-%m-%dT%H-%M-%S") # Not ISO format to make it a valid filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird to me, log_to_mlflow should be a member of QueryResult IMO.

q = ds.all()
...
with mlflow.start_run() as run:
    q.log_to_mlflow()
    ...

Maybe you could still be able to log a Datasource regardless, but it looks like a footgun for logging the wrong thing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a log_to_mlflow function in the datasource before we thought about having it in the QueryResult.
I can take it out of datasource and put it into the query result, but then that's even more API breaking.
Your call

now_time = datetime.datetime.now.strftime("%Y-%m-%dT%H-%M-%S") # Not ISO format to make it a valid filename
uuid_chunk = str(uuid.uuid4())[-4:]

artifact_name = f"log_{self.source.name}_{now_time}_{uuid_chunk}.dagshub.dataset.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if as_of is provided?
Or maybe you meant the two values to be independent?
Why do we need a timestamp in the file name if it's not tied to the datasource as_of?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point honestly, I didn't think about what happens if there's already an as_of.

@kbolashev kbolashev requested a review from simonlsk December 15, 2024 08:51
Add QueryResult.log_to_mlflow()

Change the docs to not use the deprecated function
Make it load datasources with the as_of of the timestamp
Make it load all datasource artifacts from the run if no artifact file specified
@kbolashev
Copy link
Member Author

kbolashev commented Dec 15, 2024

Changes after first review:

  • Datasource.log_to_mlflow marked as deprecated, send users to QueryResult.log_to_mlflow
  • Add QueryResult.log_to_mlflow function that logs the QueryResult to MLflow with a log_ prefix
  • Change the docs of the related to functions refer to the QueryResult function and not the Datasource one
  • Changes to datasources.get_from_mlflow:
    • Now returns a dictionary of {artifact_path: Datasource}
    • If artifact_name is specified, only loads this artifact, but still returning a dictionary
    • If artifact_name is not specified, list all artifacts in mlflow, and loads any artifact ending with .dagshub.dataset.json as another datasource
    • Set the as_of of every loaded Datasource to be the timestamp in the artifact, even if there was no as_of specified in the original query.

I also brought back the artifact_name arg in the Datasource.log_to_mlflow(), since imo a deprecation warning is enough to scare users. I also check if their name ends with .dagshub.dataset.json and add it if it doesn't exist, so even if they mess up their artifact name, it should still be parsed by the frontend hopefully

Copy link
Contributor

@simonlsk simonlsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks!

@@ -94,3 +101,24 @@ def _import_module(self):
# Update this object's dict so that attribute references are efficient
# (__getattr__ is only called on lookups that fail)
self.__dict__.update(module.__dict__)


def deprecated(additional_message=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
@kbolashev kbolashev merged commit 5184452 into master Dec 17, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants