Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add version specifiers to schemas (and potentially cabs and recipes) #306

Open
o-smirnov opened this issue Jun 1, 2024 · 7 comments
Open
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@o-smirnov
Copy link
Member

Related to discussions with @JSKenyon... how do we deal with evolving CLIs. Multiple versions of, say, wsclean are already supported via image: version, but there's no way to tell Stimela that a particular parameter is only available with e.g. version 3.1 and up (or, conversely, has been deprecated).

Proposal:

  • Cabs shalt have an optional version attribute, populated from cab: image: version if not set.

  • Parameter schemas shalt have an optional versions attribute, specified PyPI style, e.g. versions: >=3.1.

  • Inputs/outputs shalt be (de)activated by comparing their version string to the cab version, if both are specified. There are standard libraries for version parsing.

  • If we really want to be user friendly, we don't just delete a deactivated parameter from the schema, we leave a stub entry so that stimela can tell the user they've specified a parameter from a wrong version of the cab.

Possible more advanced features:

  • Support version specification in the step itself, e.g. cab: wsclean>=3.1. In the first instance, this at least allows the recipe to error out in prevalidation if the wrong version of the cab is defined.

  • This opens the door to having multiply versioned cab definitions in e.g. cult-cargo, with stimela being able to resolve which one to use if the recipe specifies a particular dependency. Those would have to live under a separately structured versioned_cabs (or something like that) top-level section, lest we break existing recipes which use cabs.

  • This is easily extended to supporting and checking optional recipe versions, e.g. recipe: tron>=0.1.

Thoughts @sjperkins @SpheMakh @landmanbester?

@o-smirnov o-smirnov added the enhancement New feature or request label Jun 1, 2024
@o-smirnov o-smirnov added this to the R2.1 milestone Jun 1, 2024
@o-smirnov o-smirnov self-assigned this Jun 1, 2024
@sjperkins
Copy link
Collaborator

I'll take a closer look at this during the week but, for the moment, it's probably worth mentioning that Schema Evolution is an entire topic in it's own right: there's probably a good deal of prior knowledge that can be drawn on.

On example that springs to mind is Google Protocol Buffers which define message schemas for use by Remote Procedure Calls (gRPC). They can evolve over time and, for e.g. "Modifying gRPC services over time", suggests some best practices in this context.

@JSKenyon
Copy link
Collaborator

JSKenyon commented Jun 3, 2024

I am obviously on board for this but I think we could be even more ambitious/user-friendly. I think I have mentioned my reservations about coupling image versions to the cult-cargo version. Specifically, this may eventually make installation and reproducibility very difficult. I think that @o-smirnov's points above are part of the solution but I think that the real change needs to happen in cult-cargo.

Specifically, I think that we should consider turning cult-cargo into a queryable database of images and schemas i.e. all schemas and images persist in this database. This would make it possible to completely decouple image versioning from package versioning and simultaneously (partially) solves the schema problem as every image could be associated with a schema. This also means that there is no need to expose the schema parameter to the average user, which will spare us some headaches. Obviously, this interface should support pulling in schemas at runtime i.e. as part of validation.

  • If we really want to be user friendly, we don't just delete a deactivated parameter from the schema, we leave a stub entry so that stimela can tell the user they've specified a parameter from a wrong version of the cab.

I think that it is completely acceptable to bail out if a user attempts to use a parameter which is not part of the associated schema, and being too clever on this point may lead to future pain. We can easily minimize schema duplication as having multiple images share a schema would not be difficult. As an aside, this also means that new images could be added without requiring a package release of cult-cargo which in turn alleviates developer burden and makes using dev/branch images much simpler.

In order to fully support the above, we would need to make versioning in stimela recipes less opaque i.e. at present there is an implication that using an image is the latest version in cult-cargo (when using cult-cargo images). This is great for user-friendliness but my personal opinion is that this does not serve the end goal of reproducibility. To address this, we could consider a stimela publish command that does the following:

  • Explicitly adds version fields into all cabs recognized by cult-cargo.
  • Creates a flat recipe.yaml and a flat cabs.yaml.
  • pip freeze any virtual environments into a venv directory.
    • Adds info fields to any step requiring externally defined Python (breifast case) to make it very clear that these are not images/where that code should live.
  • ... (I reserve the right to add things as they occur to me)

There are some other things we could consider (if we decide to be very ambitious):

  • Include recipes in the database i.e. if there are recipes/sub-recipes which are reused regularly, we can version them and make them queryable in same way as images. This would require recursive functionality, as the user would specify something like step: recipe: cookbook.cc/breifast:version, and that recipe would internally have explicitly defined image versions/schemas associated with it.
  • Partial support for github-based steps - not a fully formed idea as yet - e.g. cabs: name: package: https://github.com/ratt-ru/QuartiCal.git where this implies installing said package in the python-astro container. This would be powerful for testing/development.

I am going to stop here as this is getting muddled. I could also try and parse this into separate ideas and put them on cult-cargo if necessary.

@sjperkins
Copy link
Collaborator

I'd imagine at some point that dependency resolution may become a necessity (similar to pip). There's an outdated Python package called mixology which appears to handle dependency resolution for the concept of a generic package (i.e. not specific to a Python package on pypi). It's based on the pubgrub algorithm, which seems to be the current state of the art.

@o-smirnov
Copy link
Member Author

I'm a bit reticent about adding top-heavy structures... the current scheme is simple, and relies on standard repositories (PyPI and quay.io) where all versions persist. It's also easy to replicate for somebody who wants to maintain their own cult-cargo-like collection. I also think all information required for full reproducibility is already in there, unless I'm missing something. Let me try to address some points.

(partially) solves the schema problem as every image could be associated with a schema.

Well it's already being done in the reverse sense -- each cult-cargo cab in the release already has a specific cab: image: version entry. And recipes don't directly deal with images -- they specify cab definitions -- so I don't think an explicit image->cab link is necessary.

there is no need to expose the schema parameter to the average user,

Which schema parameter do you mean? The average user just works with an overall cult-cargo release version, which, in turn, implies frozen versions of all constituent packages under the hood (where the average user need not look).

less opaque i.e. at present there is an implication that using an image is the latest version in cult-cargo (when using cult-cargo images)

I don't think this is the implication, but I also think we mean different things by "latest", are you thinking of it as a mutable, continuously updating version? There is no such thing once a given release of cult-cargo is out. There is simply a default image version (which does have a specific well-defined number). It is "latest" only in the sense of "latest at time of this specific cult-cargo release". Once a cult-cargo release is out, the associated images don't change anymore.

The only time we change images is during a cult-cargo prerelease process. I.e. 0.1.3 is the next version -- I'll keep pushing new images with that tag until 0.1.3 is released. The cult-cargo build script already has protections for this, it will refuse to push images for a known release.

I think that it is completely acceptable to bail out if a user attempts to use a parameter which is not part of the associated schema, and being too clever on this point may lead to future pain.

Agreed. I was merely suggesting a friendlier message when bailing out ("unsupported parameter because you have version blah" as opposed to just "unknown parameter").

As an aside, this also means that new images could be added without requiring a package release of cult-cargo

This is already the case somewhat. As soon as we push 0.1.3pre1, we are free to push and push 0.1.3 images, until we hit the release button on 0.1.3 proper (see above). Also, a dev version of something doesn't even need to use cult-cargo. I could push breifast images to my own personal repo and keep shipping dev cabs pointing to them, all the way until breifast makes it into cult-cargo.

could consider a stimela publish command that does the following:

Good idea (and touches on the certifiable workflows discussion), so let's break this out into a separate issue/discussion.

Partial support for github-based steps - not a fully formed idea as yet

I was thinking of something similar in #115 (in a pure venv context), but indeed this could also be done with images.

@JSKenyon
Copy link
Collaborator

JSKenyon commented Jun 3, 2024

I'm a bit reticent about adding top-heavy structures... the current scheme is simple, and relies on standard repositories (PyPI and quay.io) where all versions persist. It's also easy to replicate for somebody who wants to maintain their own cult-cargo-like collection. I also think all information required for full reproducibility is already in there, unless I'm missing something. Let me try to address some points.

I don't think abandoning either of PyPI or quay.io would be required. I think that having a layer on top of them that maps schema (in the sense of cab definitions i.e. the parameters the cab accepts) may just be helpful in the long run.

Well it's already being done in the reverse sense -- each cult-cargo cab in the release already has a specific cab: image: version entry. And recipes don't directly deal with images -- they specify cab definitions -- so I don't think an explicit image->cab link is necessary.

Agreed, although I really do stand by my opinion that using cult-cargo as version control is going to bite us. It means we are vulnerable to changes in Python versions and may place limits on how long after the fact a result remains reproducible. The hypothetical scenario in my head is as follows: I run a recipe today with cult-cargo==1.0.0 on Python 3.9.18. In five years (yes, this is quite a long time but bear with me), someone else wants to reproduce that result on Python 3.14.8. There was no cult-cargo release for that version of Python and let us assume for the sake of argument that the user cannot install an older Python (which may be true). There is now no easy way to reproduce the result, despite all the images still being available.

Which schema parameter do you mean? The average user just works with an overall cult-cargo release version, which, in turn, implies frozen versions of all constituent packages under the hood (where the average user need not look).

I think the use of schema confused this point - apologies. What I mean is that for packages included in cult-cargo if we have this extra layer which knows which cabs (i.e. which parameter schema) map to which version the user need never worry about this. This achieves the same result as the current approach, but without requiring a specific version of cult-cargo in order for the cab definition to be correct.

I don't think this is the implication, but I also think we mean different things by "latest", are you thinking of it as a mutable, continuously updating version? There is no such thing once a given release of cult-cargo is out. There is simply a default image version (which does have a specific well-defined number). It is "latest" only in the sense of "latest at time of this specific cult-cargo release". Once a cult-cargo release is out, the associated images don't change anymore.

I understand this point but I maintain that this is opaque. It means that if a user were to read the recipe, there would be absolutely no way of knowing which versions were in use without either checking other files (requiring more expert knowledge) or installing a specific version of cult-cargo which brings us back to my earlier point.

The only time we change images is during a cult-cargo prerelease process. I.e. 0.1.3 is the next version -- I'll keep pushing new images with that tag until 0.1.3 is released. The cult-cargo build script already has protections for this, it will refuse to push images for a known release.

Ok, this is something I hadn't thought about. That is fair enough. I will point out that in the current model, cult-cargo may end up releasing many versions very quickly if the goal is to make upstream packages available rapidly (which I think is the case).

This is already the case somewhat. As soon as we push 0.1.3pre1, we are free to push and push 0.1.3 images, until we hit the release button on 0.1.3 proper (see above). Also, a dev version of something doesn't even need to use cult-cargo. I could push breifast images to my own personal repo and keep shipping dev cabs pointing to them, all the way until breifast makes it into cult-cargo.

On the point about private repos, absolutely. I have done so too. What I meant by this point is that we could push a hypothetical quartical:1.0.0, quartical:1.0.1 and quartical:1.1.0 all without ever requiring a cult-cargo release if we used my proposed (completely theoretical at this point) approach. I think that this could eventually spare us pain and potentially speed up the process for getting new versions into the hands of users.

Finally, just to reiterate, if the goal is reproducibility, I sincerely believe we have to decouple the "runtime" requirements i.e. the cab definitions and images, from the Python code/packaging infrastructure.

@o-smirnov
Copy link
Member Author

There was no cult-cargo release for that version of Python and let us assume for the sake of argument that the user cannot install an older Python (which may be true).

Fair point. This is where the PyPI model breaks. Still, I like the simplicity of it for now, so maybe we can muddle our way forward to a more structured scheme while we retain backwards compatibility?

It means that if a user were to read the recipe, there would be absolutely no way of knowing which versions were in use

Arguably this is a good thing. The top-level recipe should not be burdened by details, it's more readable that way. For those that want to get into the versioning weeds, there is the stimela publish idea. So if you (literally) publish a result, you provide the recipe as a top-level recipe, and the stimela publish outputs as supplementary material (which is required to reproduce).

cult-cargo may end up releasing many versions very quickly if the goal is to make upstream packages available rapidly (which I think is the case).

The model I followed for 0.1.2 was multiple 0.1.2preX releases of cult-cargo, while the images themselves were versioned 0.1.2 and were being updated. Do you think this works going forward? Bleeding edge people can use the pre-releases and/or track cult-cargo master. At some point we make a proper release, images get frozen, and another pre-release cycle starts.

@JSKenyon
Copy link
Collaborator

JSKenyon commented Jun 3, 2024

Fair point. This is where the PyPI model breaks. Still, I like the simplicity of it for now, so maybe we can muddle our way forward to a more structured scheme while we retain backwards compatibility?

Yeah - any changes weren't going to be short term regardless. Just something to keep at the back of our heads.

Arguably this is a good thing. The top-level recipe should not be burdened by details, it's more readable that way. For those that want to get into the versioning weeds, there is the stimela publish idea. So if you (literally) publish a result, you provide the recipe as a top-level recipe, and the stimela publish outputs as supplementary material (which is required to reproduce).

Agreed. So the policy is that versions float with the cult-cargo version until such time as you freeze them in with publish.

The model I followed for 0.1.2 was multiple 0.1.2preX releases of cult-cargo, while the images themselves were versioned 0.1.2 and were being updated. Do you think this works going forward? Bleeding edge people can use the pre-releases and/or track cult-cargo master. At some point we make a proper release, images get frozen, and another pre-release cycle starts.

Ok, that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants