Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightweight tags and tag namespaces #127

Open
aguschin opened this issue Apr 15, 2022 · 9 comments
Open

Lightweight tags and tag namespaces #127

aguschin opened this issue Apr 15, 2022 · 9 comments
Labels
design Design questions, that affects the product significantly question Further information is requested

Comments

@aguschin
Copy link
Contributor

Lightweight tags are created with simple git tag TAGNAME and doesn't contain information about creator, date, etc. GTO doesn't create them, but user can. They can register versions or promote something.

Now they're ignored by GTO, although they still can trigger CI and though they look pretty similar to annotated ones for unexperienced user. We need to support them I guess.

I would suggest reading and interpreting them by default, but supporting some flag --annotated to filter them out.
Downside of this is that with lightweight tags you don't know date of promotion, author, etc, so gto history and gto show myartifact will have some fields blank.

@aguschin aguschin added the question Further information is requested label Apr 15, 2022
@aguschin aguschin added this to the First GTO release milestone Apr 15, 2022
@omesser
Copy link
Contributor

omesser commented Apr 16, 2022

@aguschin - Can I ask why is this needed? Is there a use case? They lack mandatory information by definition - either limiting us to not have mandatory field, or, create inconsistency.

Also they usually are meant for local usage and not for collaboration, and it's a bad practice to push them to remotes - sounds like local inconsistencies is not the wanted property from a registry - usually all about consistent and shared source of truth to rely on in production ops.

Partially unrelated - Maybe we should even prefix gto's tags with gto: so as to avoid accidental mixup between gto's tags and other annotated tags?

@aguschin
Copy link
Contributor Author

aguschin commented Apr 16, 2022

@omesser thanks for your feedback!

@aguschin - Can I ask why is this needed? Is there a use case? They lack mandatory information by definition - either limiting us to not have mandatory field, or, create inconsistency.

No clear use case, I just think it can be confusing for users why the have some tag, but GTO doesn't take it into account. E.g. you see [email protected] LW tag, but don't see a v0.1.2 version for rf. Or you trigger CI with manually created LW tag rf#prod-5, successfully deploy your model with CI, and then find out that GTO don't know about promotion.

My take was that we should encourage users to create annotated tags and don't create LW ones, but still offer some way to support them. Do you argue that we don't need to support them at all? Or that by default we shouldn't support them and instead of --annotated have --lightweight that will take into account LW tags?

Partially unrelated - Maybe we should even prefix gto's tags with gto: so as to avoid accidental mixup between gto's tags and other annotated tags?

🤔 this will remove simplicity @dmpetrov wants to have, but we can make this optional I guess. On the other hand, may be it's ok to parse other tags also - e.g. if user have convention to register versions with name@version he'll also see his things in GTO.

As a side note, I saw some tools that use this structure, but tag conventions are different (related to #114). E.g.

  1. rushstack (it's a tool). Tag example: @rushstack/heft-jest-plugin_v0.2.14
  2. Lerna (it's a tool), example repo. Tag example: [email protected]. Another example repo: jupyterlab. Tag example: @jupyterlab/[email protected]
  3. Changesets (tool), example repo. Tag example: @emotion/[email protected]

We can't support all of these, I believe (maybe some cmd flags can help?). Right now we are compatible with 2, 3.

@omesser
Copy link
Contributor

omesser commented Apr 17, 2022

Thanks for the quick answer @aguschin !

LW tags

"No clear use case, I just think it can be confusing for users why the have some tag, but GTO doesn't take it into account. E.g. you see [email protected] LW tag, but don't see a v0.1.2 version for rf. Or you trigger CI with manually created LW tag rf#prod-5, successfully deploy your model with CI, and then find out that GTO don't know about promotion."

Why would users create tags manually and not by gto in the first place here? and how are they supposed to know about the restrictions and format of gto "managed" tags? It sounds like advanced usage to me personally - using gto would be easier, safer and simpler than creating tags manually for most users.

And of course, there's the other direction - What if the users want to create tags to trigger CI/CD or other reasons but don't want / don't foresee the effects in gto / the "registry"? How will they "opt out" if gto takes over all tags in the repo?

My biggest concern with LW tags is their tendency to be local. My local gto run would show me things that won't show on anyone else's checkout (or in studio) because they weren't pushed to upstream repo

About namespacing tags

"this will remove simplicity @dmpetrov wants to have"

I'm not sure picking up all tags makes things simpler or actually more complicated, it might make it worse for the team collaborating, if people start mixing formats and doing things manually by directly manipulating tags. Because a very real and likely possibility is that they also will manually remove tags to deregister (natural) and modify them instead of adhering to prod-N convention, and that's not how we want it to work in gto. it will make us lose history which is most of the value gto gives with those conventions over using git tags directly

Suggestions

So I would suggest a logical way to take this is:

  • Clearly namespace/differentiate gto tags - "be a good citizen with other systems / conventions" - something like @jupyterlab/ -> @gto/... or similar. This should be the default behavior imo. It does not complicate anything, but it does make the tags looks a bit less "clean". That is indeed a bit of a downside product-wise
  • Users with permissions can also manipulate tags - we can't control/prevent that anyways, so users have the ability to make intrusive registry-effecting operations if they opt-in (❗ ) - meaning use the above convention when tagging. we don't lose that
  • I guess there is no harm in allowing users to have a way to easily have gto discover and operate on existing tags, if they "coincidentally" happen to be compatible. This can be done by using something like --tag-prefix=X or even --tag-prefix="" to remove the prefix assumption/filter and make gto "take over" all tags. Suggest to keep this non default though.

Would love to get more opinions on this (CC @aguschin @dmpetrov @shcheklein @casperdcl @mike0sv ) - Is "clean looking" tags the priority here for a default behavior?
I personally tend to favor healthy behavior in advanced/production use cases / large teams when things get "hairy" over the "demo mode" scenarios, and don't think a gto prefix takes away from that significantly.
Missing proper namespace / over simplifying can cause frustration when users start integrating tools in their real environments and start hitting walls / edge-cases only after scaling their usage, when it's too late or expensive to modify modus operandi.

@aguschin
Copy link
Contributor Author

Two thoughts in continuation of this discussion

Namespaces

After some thinking, I would suggest what's came up in Product Sync call - using namespaces as types.
Using model and dataset namespace makes sense to me to easily (and without artifacts.yaml) identify artifact types, although it may have different sense in Lerna/Changesets (namespaces there are namespaces on NPM).

If we use model/[email protected] + dataset/[email protected] instead of [email protected] + [email protected], it will make this type approach thing more general.
I see two downsides:

  • possible conflict with artifacts.yaml
  • slightly more complex tag structure

Special names

On the other hand, I don't think hardcoding model and dataset as special names make sense in GTO to avoid messing with Studio BE. Studio should query GTO API to get all artifacts anyway, and then will filter them by "type". When filtering, BE can show special aliases like "model" together with type=model.

One downside of this is having different things shown in GTO and Studio. But I have troubles suggesting a case when this will break something when you have a single artifact of type "model" in repo. And when you have artifacts.yaml, it is not hard to add name "model" there and put type "model" to it, so everything works similarly in Studio and in CLI.

@jellebouwman
Copy link
Contributor

Thanks for this discussion @aguschin & @omesser - this thread came up during yesterday's Studio Front-end Sync when we were discussing the last steps we need to perform to get Models into the Studio View table. Studio might also benefit from what is being proposed here. I don't think Studio's implementation needs to prioritised, but it might be nice to consider it.

Studio context

As you can see in this screenshot, Screenshot 2022-04-19 at 09 47 03

(Link to Figma design)

for each commit row in the table, in the first cell we show a commit name. The commit name consists of any tags associated with the commit. In addition to this existing behaviour, we want to show the GTO model version inside the model's column cell. To be able to reliably filter out the GTO tag that describes the model version, having these namespaced tags would make this job easier. cc @Suor

@aguschin aguschin changed the title Add support for lightweight tags Lightweight tags and tag namespaces Apr 19, 2022
@aguschin
Copy link
Contributor Author

Also another thought: right now our convention for versions is [email protected], with v. Lerna/Changesets don't use v, and have tags like [email protected]. If we want to be compatible with them for some reason, we need to remove v. If we, otherwise, want to be different from them - we may keep that v as one differentiator.

@omesser
Copy link
Contributor

omesser commented Apr 20, 2022

Please see this discussion about v prefix. It's not officially part of the version/semver but it is a super common way to "tell" that the trailing string denotes a version.
And it is common practice in git tags: https://git-scm.com/book/en/v2/Git-Basics-Tagging

I suggest we keep it. It looks like we're still discussing the prefixes and namespaces - and it would be weird if we end up supporting 1.0.0 tag and not v1.0.0 tag. I also don't think it takes away value - [email protected] looks better than [email protected] to me, as I'm used to seeing v prefixes, and it clearly denotes the trailing number is a version, which it is

@casperdcl
Copy link

I prefer v too - some version-parsing tools expect it

@aguschin
Copy link
Contributor Author

For the record, how we may handle namespaces is now affected by current decision to support monorepos: tag models=mymodel references artifact mymodel annotated in models/dvc.yaml.

@aguschin aguschin added the design Design questions, that affects the product significantly label Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design questions, that affects the product significantly question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants