Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to identify the latest pin in an s3_board #790

Closed
jeffkeller87 opened this issue Sep 23, 2023 · 7 comments
Closed

[Question] How to identify the latest pin in an s3_board #790

jeffkeller87 opened this issue Sep 23, 2023 · 7 comments

Comments

@jeffkeller87
Copy link

jeffkeller87 commented Sep 23, 2023

A follow-up to this post and similar to #590.

I love how I can use {pins} instead of maintaining my own artifact management process. It really cuts down on the amount of boilerplate code in my projects!

However, I often have a need to read pins from a system where installing either the R or Python {pins} package is not possible. In my case, these systems are ephemeral continuous integration runners with a limited set of software installed. Specifically, I am grabbing the latest model artifacts from S3 to COPY into a Docker image.

My current solution is to write artifacts to a latest/ prefix (or directory, if S3 is not the storage media) in addition to a timestamped prefix. If the storage media is a filesystem, I sym- or hard-link latest/ to the appropriate timestamped directory. The structure looks something like this:

repo
└── artifact
    ├── 2023‐09‐22T11:41:35Z
    ├── 2023‐09‐23T11:42:53Z
    └── latest

From a system without {pins}, I can then reference a static path to get the latest artifacts.

aws s3 sync s3://repo/artifact/latest .
cp -R .../repo/artifact/latest .

Without {pins}, is there a straightforward way to identify the latest pin version in a board?

@juliasilge
Copy link
Member

Thanks for this question @jeffkeller87! I think the short answer is "no" because we haven't built either R or Python pins with an eye toward being used by directly from a shell or similar, but since it's all just files and directories, certainly you have options:

  • The versions for an S3 board (or GCS, or Azure, etc) are a timestamp pasted together with a truncated hash, so they will sort in the correct order. You can use that sensible naming scheme to get the latest version with something like ls | tail -1.
  • Have you checked out the new-ish manifest file functionality? This lets you be more explicit in which versions you want to track and use. You can read more in this vignette, and I could see reading that YAML file with something like yq to find the latest version, then opening up that directory.
  • I think your current approach of copying what you want into a /latest directory is great too! Maybe it is the most straightforward for your situation.

I don't think it's likely that pins starts keeping a /latest directory since that's not the main use case we're targeting, but certainly you can use the directory structure (maybe together with a manifest file) to manage this from a shell in a couple of ways.

@jeffkeller87
Copy link
Author

Thank you @juliasilge for the very thoughtful response. I agree that replicating the /latest copy / link within pins probably isn't the right thing to do. However, if there is room to improve the interop surface of pins, I think that would be worth pursuing.

To that end, is the naming schema sufficient for determining the latest version? I figured the truncated hash would cause issues if more than one pin version was written within the same second. That's probably good enough for what I'm doing, but I can see it causing issues in other scenarios. Do you have a strong preference for the hash over sub-second markers?

The manifest file was the other option I considered. My optimism deflated a bit when I saw it was a YAML file rather than JSON--only because of how long it took me to convince my Infrastructure / IT people to install jq in our runner image. Theoretically, I could get them to install yq too :)

@juliasilge
Copy link
Member

Oh yes, you are definitely correct that the timestamp doesn't distinguish between versions written within the same second. This has come up before and to date, the only time this has been a problem is in kind of "fake" situations, like when building a vignette or when people are writing tests in other packages that use pins. We haven't heard of problems with the timestamp in people's real work, since most folks are pinning, say, a model binary or a summarized dataset coming out of an ETL pipeline. Folks are generally not using pins for super high performance writing, at least so far.

In your use case, would subsecond information be practically important?

## what we do now:
format(Sys.time(), "%Y%m%dT%H%M%SZ", tz = "UTC")
#> [1] "20230926T161828Z"

## we could do something like:
format(Sys.time(), "%Y%m%dT%H%M%OS2Z", tz = "UTC")
#> [1] "20230926T161828.26Z"

Created on 2023-09-26 with reprex v2.0.2

@jeffkeller87
Copy link
Author

In my cases, there should be no chance of a sub-second temporal collision like that. But there's always those unexpected scenarios where another writer sneaks in at just the wrong time, and then pulling hair figuring out what happened when the pin you just wrote isn't the one that gets read immediately after (using the ls | tail method).

Modifying the timestamp format would shrink the probability further, but it makes specifying an explicit version more onerous in pin_read().

I think the current behavior is fine as-is. If someone is writing this frequently intentionally, they probably don't want a versioned board anyway.

@juliasilge
Copy link
Member

That makes a lot of sense. I'm going to leave this issue open for discussion in case other folks come by with this same need in the near future; we can reevaluate as we hear more on it. Thanks again for the question @jeffkeller87!

@juliasilge
Copy link
Member

It sounds like we haven't seen a high need for improvements in this area so I am going to close this issue. We can revisit in the future if we hear more from users on this! 🙌

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants