-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artifact Manager #151
Artifact Manager #151
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG in general. I would suggest to add timestamp as well. We can then add search by time functionality.
The default
collection is a bit odd. I assume that the typical usage will be informally assuming some semantic or schema consistency for each collection and organize artifact by that, then default because like whatever
. I think there is also a risk that people forget to specify the collection and incorrectly put artifacts in default collection, which can be hard to notice and hard to fix.
I think a nice extension is for artifact registry would be collections with explicit schema definition, say in json-schema
format. The benefits:
- add validation logic
- amenable to build analytic layers on top
- better search capabilities, e.g. with SQL
manager.log_object(obj2, artifact_name, collection) | ||
assert manager.get_artifact(artifact_name, collection) == obj2 | ||
assert manager.get_artifact(artifact_name, collection, 2) == obj1 | ||
assert manager.get_artifact(artifact_name, collection, 1) == obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manager.get_artifact(artifact_name, collection, 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would just fail, right?
squirrel/artifact_manager/base.py
Outdated
raise NotImplementedError | ||
|
||
@abstractmethod | ||
def get_artifact_source( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not implement artifact logic here and keep cohesion high, but let a user convert this manager to a catalog and let him/her retreive the source from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the source description of squirrel would be a nice vehicle to expose for pre-processing and filtering artifacts before fetching/downloading them based on meta-data?
squirrel/artifact_manager/base.py
Outdated
raise NotImplementedError | ||
|
||
@abstractmethod | ||
def log_object(self, obj: Any, name: str, collection: Optional[str] = None) -> Source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to retrieve objects again? can we restrict it to safe object types? e.g. safetensors, numpy, primitive types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to log_artifact
- can be retrieved via get_artifact
.
In general everything that can be serialised via the backend serialiser (currently messagepack, later deltalake) can be logged. What would be the motivation for restricting this?
The collection attribute is intended as an active collection to which objects are logged if no other target is provided. The assumption here is that this is set e.g. to the job id at the beginning. I renamed the attribute to make this more explicit.
Absolutely agree that this will be a useful extension, however, for me this is part of the semantic layer which exposes functionality for defining a schema and retrieving/filtering objects based on it.
This seems useful but will keep it out of the basic logger as it unnecessarily complicates the interface. |
This should not introduce any change to the interface. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, generally LGTM so far.
squirrel/artifact_manager/base.py
Outdated
|
||
@abstractmethod | ||
def log_artifact(self, obj: Any, name: str, collection: Optional[str] = None) -> Source: | ||
"""Log an arbitrary python object""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if docstring has the serializer we use. Also, are we sure we need this? How are different python versions and pickle versions affecting this? I heard stories that people really dont like pickle for serializing due to version dependencies occurring. Having to figure out which versions were used during logging in able to load a year old artifact seems really painful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serialiser depends on the backend - WandB has it's own, for SquirrelFileStores it's actually explicitly chosen when initialising the Manager.
This is PR is marked as stale as it has been inactive for 30 days. It will be closed in 7 days. |
@AlirezaSohofi @ThomasWollmann @adrianloy I updated the PR with your comments and also added a basic version of the WandB manager. This goes against what we discussed where each artifact is an individually versioned file or serialised python value. I now somewhat abused the WandB artifact to force each committed value into a separate WandB artifact. While I don't think it's a huge blocker right now, I'd be interested in your input on whether or not we should relax our assumption here and allow multiple files per artifact (not necessarily an issue) or stick with this. Also let me know if you any additional comments regarding the overall API. |
d1bc2a5
to
5af0566
Compare
I would follow their approach. I think wandb integration should be first citizen, as this is likely how we will use it the most. I thought the old artifact manager also coul dbe used for "log everything of this local folder as a single artifact" and I think in general that makes sense. So I think relaxing of our assumptions is fine. |
Co-authored-by: miha g <[email protected]>
2635599
to
ce7cb3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go from my side as a first version, further implementation can go into follow-up PRs. Looking forward to use the new artifact manager woop woop
Description
This PR introduces a basic artifact manager functionality based on squirrel stores.
Fixes # issue
Type of change
Checklist: