Artifact Manager #151

SebS94 · 2023-10-23T07:17:45Z

Description

This PR introduces a basic artifact manager functionality based on squirrel stores.

Fixes # issue

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring including code style reformatting
Other (please describe):

Checklist:

I have read the contributing guideline doc (external contributors only)
Lint and unit tests pass locally with my changes
I have kept the PR small so that it can be easily reviewed
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
All dependency changes have been reflected in the pip requirement files.

github-actions · 2023-10-23T07:18:00Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

SebS94 · 2023-10-23T14:00:03Z

I have read the CLA Document and I hereby sign the CLA

squirrel/artifact_manager/wandb.py

squirrel/artifact_manager/fs.py

AlirezaSohofi

LG in general. I would suggest to add timestamp as well. We can then add search by time functionality.

The default collection is a bit odd. I assume that the typical usage will be informally assuming some semantic or schema consistency for each collection and organize artifact by that, then default because like whatever. I think there is also a risk that people forget to specify the collection and incorrectly put artifacts in default collection, which can be hard to notice and hard to fix.

I think a nice extension is for artifact registry would be collections with explicit schema definition, say in json-schema format. The benefits:

add validation logic
amenable to build analytic layers on top
better search capabilities, e.g. with SQL

squirrel/artifact_manager/base.py

squirrel/artifact_manager/fs.py

AlirezaSohofi · 2023-11-02T17:24:29Z

test/test_artifact_manager/test_logging.py

+    manager.log_object(obj2, artifact_name, collection)
+    assert manager.get_artifact(artifact_name, collection) == obj2
+    assert manager.get_artifact(artifact_name, collection, 2) == obj1
+    assert manager.get_artifact(artifact_name, collection, 1) == obj


manager.get_artifact(artifact_name, collection, 4)

This would just fail, right?

squirrel/artifact_manager/base.py

ThomasWollmann · 2023-11-05T10:23:58Z

squirrel/artifact_manager/base.py

+        raise NotImplementedError
+
+    @abstractmethod
+    def get_artifact_source(


would not implement artifact logic here and keep cohesion high, but let a user convert this manager to a catalog and let him/her retreive the source from there.

I thought the source description of squirrel would be a nice vehicle to expose for pre-processing and filtering artifacts before fetching/downloading them based on meta-data?

squirrel/artifact_manager/base.py

ThomasWollmann · 2023-11-05T10:34:08Z

squirrel/artifact_manager/base.py

+        raise NotImplementedError
+
+    @abstractmethod
+    def log_object(self, obj: Any, name: str, collection: Optional[str] = None) -> Source:


how to retrieve objects again? can we restrict it to safe object types? e.g. safetensors, numpy, primitive types

Renamed to log_artifact - can be retrieved via get_artifact.
In general everything that can be serialised via the backend serialiser (currently messagepack, later deltalake) can be logged. What would be the motivation for restricting this?

SebS94 · 2023-11-07T14:22:04Z

The default collection is a bit odd. I assume that the typical usage will be informally assuming some semantic or schema consistency for each collection and organize artifact by that, then default because like whatever. I think there is also a risk that people forget to specify the collection and incorrectly put artifacts in default collection, which can be hard to notice and hard to fix.

The collection attribute is intended as an active collection to which objects are logged if no other target is provided. The assumption here is that this is set e.g. to the job id at the beginning. I renamed the attribute to make this more explicit.

I think a nice extension is for artifact registry would be collections with explicit schema definition, say in json-schema format.

Absolutely agree that this will be a useful extension, however, for me this is part of the semantic layer which exposes functionality for defining a schema and retrieving/filtering objects based on it.

I would suggest to add timestamp as well. We can then add search by time functionality.

This seems useful but will keep it out of the basic logger as it unnecessarily complicates the interface.
What I think makes sense here is to provide a minimal schema containing basic attributes (e.g. author, timestamp) as part of the semantic layer.

AlirezaSohofi · 2023-11-13T13:44:46Z

This seems useful but will keep it out of the basic logger as it unnecessarily complicates the interface.
What I think makes sense here is to provide a minimal schema containing basic attributes (e.g. author, timestamp) as part of the semantic layer.

This should not introduce any change to the interface. The minimal schema you mentioned was actually aligned with what I had in mind. Whether it's part of artifact itself or semantic layer is up for discussion and involves trade-offs.

adrianloy

Some comments, generally LGTM so far.

squirrel/artifact_manager/base.py

adrianloy · 2023-11-13T15:23:38Z

squirrel/artifact_manager/base.py

+
+    @abstractmethod
+    def log_artifact(self, obj: Any, name: str, collection: Optional[str] = None) -> Source:
+        """Log an arbitrary python object"""


would be nice if docstring has the serializer we use. Also, are we sure we need this? How are different python versions and pickle versions affecting this? I heard stories that people really dont like pickle for serializing due to version dependencies occurring. Having to figure out which versions were used during logging in able to load a year old artifact seems really painful.

The serialiser depends on the backend - WandB has it's own, for SquirrelFileStores it's actually explicitly chosen when initialising the Manager.

squirrel/artifact_manager/base.py

squirrel/artifact_manager/fs.py

github-actions · 2023-12-13T18:27:01Z

This is PR is marked as stale as it has been inactive for 30 days. It will be closed in 7 days.

SebS94 · 2023-12-15T17:05:45Z

@AlirezaSohofi @ThomasWollmann @adrianloy I updated the PR with your comments and also added a basic version of the WandB manager.
Unfortunately, the WandB artifact API is actually somewhat incompatible with what we envisioned. They manage versioned artifacts within collections ("artifact types" in their terminology). Each artifact is actually not atomic but a folder of its own that can contain arbitrary files.

This goes against what we discussed where each artifact is an individually versioned file or serialised python value. I now somewhat abused the WandB artifact to force each committed value into a separate WandB artifact. While I don't think it's a huge blocker right now, I'd be interested in your input on whether or not we should relax our assumption here and allow multiple files per artifact (not necessarily an issue) or stick with this.

Also let me know if you any additional comments regarding the overall API.

adrianloy · 2023-12-18T13:23:20Z

I would follow their approach. I think wandb integration should be first citizen, as this is likely how we will use it the most. I thought the old artifact manager also coul dbe used for "log everything of this local folder as a single artifact" and I think in general that makes sense. So I think relaxing of our assumptions is fine.

Co-authored-by: miha g <[email protected]>

mg515

Good to go from my side as a first version, further implementation can go into follow-up PRs. Looking forward to use the new artifact manager woop woop

github-actions bot added a commit that referenced this pull request Oct 23, 2023

@SebS94 has signed the CLA from Pull Request #151

d98af27

ThomasWollmann reviewed Oct 30, 2023

View reviewed changes

squirrel/artifact_manager/wandb.py Show resolved Hide resolved

ThomasWollmann reviewed Oct 30, 2023

View reviewed changes

squirrel/artifact_manager/fs.py Outdated Show resolved Hide resolved

AlirezaSohofi reviewed Nov 2, 2023

View reviewed changes

ThomasWollmann reviewed Nov 5, 2023

View reviewed changes

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved

ThomasWollmann reviewed Nov 5, 2023

View reviewed changes

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved

ThomasWollmann reviewed Nov 5, 2023

View reviewed changes

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved

ThomasWollmann reviewed Nov 5, 2023

View reviewed changes

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved

ThomasWollmann reviewed Nov 5, 2023

View reviewed changes

squirrel/artifact_manager/base.py Show resolved Hide resolved

ThomasWollmann reviewed Nov 5, 2023

View reviewed changes

SebS94 requested review from ThomasWollmann, AlirezaSohofi, adrianloy, martin-genzel and flix59 November 9, 2023 08:04

adrianloy reviewed Nov 13, 2023

View reviewed changes

github-actions bot added the no-pr-activity label Dec 13, 2023

SebS94 force-pushed the seb-artifact-manager branch from d1bc2a5 to 5af0566 Compare December 15, 2023 17:08

github-actions bot removed the no-pr-activity label Dec 15, 2023

SebS94 requested a review from adrianloy December 18, 2023 12:08

adrianloy marked this pull request as ready for review December 18, 2023 13:20

ThomasWollmann removed their request for review December 18, 2023 13:28

SebS94 and others added 16 commits December 22, 2023 10:15

Integrating first round of feedback

51fe46a

Refactor logging of multiple files

18d074f

Renaming collection parameter

718ccc3

Extended tests

a3f9e72

First wandb version

3fc5646

API update

57cd175

Fixing filesystem tests

a4d91fe

Adding wandb tests

497156c

Fixing log_folder

fe811e2

Additional tests

53c7b8a

Updating wandb test-suite

ce3de84

Linting

e4695b5

Small comment

308df5d

Update squirrel/artifact_manager/fs.py

7fbfc6e

Co-authored-by: miha g <[email protected]>

Rebasing on updated requirements

1124fd4

Revisiting local interactions

ce7cb3a

SebS94 force-pushed the seb-artifact-manager branch from 2635599 to ce7cb3a Compare December 22, 2023 13:36

SebS94 and others added 4 commits December 22, 2023 15:16

Simplified download

90b1cd4

Cleanup

705522c

collection constructor arg

d5f8bbc

collection optional, styling

3c5c3d6

mg515 self-requested a review December 22, 2023 15:18

mg515 previously approved these changes Dec 22, 2023

View reviewed changes

Bump version and minor clean up

73f3754

SebS94 dismissed mg515’s stale review via 73f3754 December 22, 2023 22:23

SebS94 requested a review from mg515 December 22, 2023 22:25

mg515 approved these changes Dec 27, 2023

View reviewed changes

SebS94 merged commit 0ed5733 into main Jan 2, 2024
4 checks passed

SebS94 deleted the seb-artifact-manager branch January 2, 2024 08:55

github-actions bot locked and limited conversation to collaborators Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artifact Manager #151

Artifact Manager #151

SebS94 commented Oct 23, 2023

github-actions bot commented Oct 23, 2023 •

edited

Loading

SebS94 commented Oct 23, 2023

AlirezaSohofi left a comment

AlirezaSohofi Nov 2, 2023

SebS94 Nov 7, 2023

ThomasWollmann Nov 5, 2023

SebS94 Nov 7, 2023

ThomasWollmann Nov 5, 2023

SebS94 Nov 7, 2023

SebS94 commented Nov 7, 2023 •

edited

Loading

AlirezaSohofi commented Nov 13, 2023

adrianloy left a comment

adrianloy Nov 13, 2023

SebS94 Dec 15, 2023

github-actions bot commented Dec 13, 2023

SebS94 commented Dec 15, 2023

adrianloy commented Dec 18, 2023

mg515 left a comment

Artifact Manager #151

Artifact Manager #151

Conversation

SebS94 commented Oct 23, 2023

Description

Type of change

Checklist:

github-actions bot commented Oct 23, 2023 • edited Loading

SebS94 commented Oct 23, 2023

AlirezaSohofi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebS94 commented Nov 7, 2023 • edited Loading

AlirezaSohofi commented Nov 13, 2023

adrianloy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 13, 2023

SebS94 commented Dec 15, 2023

adrianloy commented Dec 18, 2023

mg515 left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 23, 2023 •

edited

Loading

SebS94 commented Nov 7, 2023 •

edited

Loading