-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Remove tha artifact_name argument from ds.log_to_mlflow() #563
Conversation
…tifact name in the load_from_mlflow functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to rethink a few details before we potentially break the client API more than once.
dagshub/data_engine/datasources.py
Outdated
if artifact_name is None: | ||
raise ValueError("artifact_name must be specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the signature Optional
? Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that making artifact_name
not optional means that it has to be put BEFORE the run
argument in the function signature. So instead of
def get_from_mlflow(run, artifact_name)
You'll have to make it
def get_from_mlflow(artifact_name, run)
Which will break the interface in a weirder way than having this ValueError check.
I could take Optional
out of the artifact_name type signature, but then assigning it to None immediately would look at the very least weird (and also mypy would scream at me)
Run to which the artifact was logged. | ||
""" | ||
|
||
now_time = datetime.datetime.now.strftime("%Y-%m-%dT%H-%M-%S") # Not ISO format to make it a valid filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems weird to me, log_to_mlflow
should be a member of QueryResult
IMO.
q = ds.all()
...
with mlflow.start_run() as run:
q.log_to_mlflow()
...
Maybe you could still be able to log a Datasource regardless, but it looks like a footgun for logging the wrong thing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a log_to_mlflow
function in the datasource before we thought about having it in the QueryResult.
I can take it out of datasource and put it into the query result, but then that's even more API breaking.
Your call
now_time = datetime.datetime.now.strftime("%Y-%m-%dT%H-%M-%S") # Not ISO format to make it a valid filename | ||
uuid_chunk = str(uuid.uuid4())[-4:] | ||
|
||
artifact_name = f"log_{self.source.name}_{now_time}_{uuid_chunk}.dagshub.dataset.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if as_of
is provided?
Or maybe you meant the two values to be independent?
Why do we need a timestamp in the file name if it's not tied to the datasource as_of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point honestly, I didn't think about what happens if there's already an as_of.
Add QueryResult.log_to_mlflow() Change the docs to not use the deprecated function
Make it load datasources with the as_of of the timestamp Make it load all datasource artifacts from the run if no artifact file specified
Changes after first review:
I also brought back the artifact_name arg in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks!
@@ -94,3 +101,24 @@ def _import_module(self): | |||
# Update this object's dict so that attribute references are efficient | |||
# (__getattr__ is only called on lookups that fail) | |||
self.__dict__.update(module.__dict__) | |||
|
|||
|
|||
def deprecated(additional_message=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Co-authored-by: Simon L <[email protected]>
This PR makes it so the users can no longer supply the
artifact_name
argument to theds.log_to_mlflow()
function.