-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support transient storage mode #1424
Conversation
e8c4187
to
4353843
Compare
@saschagrunert PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level:
Overall this feels like quite a lot of complexity, and it seems fairly likely to be a maintenance headache:
- Users must manually clean up half of the storage on reboot
- Where is the the lock file stored, in the persistent or non-persistent data?
- IIRC in general, creating cross-store layers is unexpected; AFAIK that only works for overlay.
- How are we going to keep this working? Do we need a whole new third mode for running all tests?
As a thought experiment, what would happen if
- there were no separate transient mode
- the “fast
podman run
” case was instead using a fresh graphroot in a temporary location, not even necessarilytmpfs
, pointing at the long-term layer+image store using the existing “additional stores” mechanism? - there were some option to just turn off the
fdatasync
, so that all the metadata I/O has negligible latency
? I think the major difference would be podman run
triggering an image pull, because the image would be pulled into the temporary layer store, not the primary one. Is that a problem?
Individual review comments are under the assumption that the high-level concerns are overcome or irrelevant — and probably not exhaustive.
containers.go
Outdated
if r.transient { | ||
return filepath.Join(r.rundir, "containers.json") | ||
} | ||
return filepath.Join(r.dir, "containers.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locally, it would be simpler and more efficient to replace this method with a containersJSONPath
field set in the constructor instead of recomputing all the time.
Then we wouldn’t even need the rundir
and transient
fields.
Applies similarly to the layer store.
containers.go
Outdated
@@ -120,6 +120,8 @@ type rwContainerStore interface { | |||
type containerStore struct { | |||
lockfile Locker | |||
dir string | |||
rundir string | |||
transient bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely non-blocking: runDir
is more idiomatic Go. The existing fields don’t conform, maybe we can just stop following that precedent.
store.go
Outdated
return ErrNotALayer | ||
} | ||
|
||
if l, err := lstore.Get(id); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unrelated six-year-old bug: This must have been intended to be err == nil
I see my comment went to the wrong place, pasting here too:
I don't think this is actually a real problem. The intent is to ship podman with some systemd unit file that just does this for you. Then all you need to worry about is ordering that before starting any systemd services using podman.
It is important to get this right (i.e use the same in all places), and I think it makes sense to have it on the transient location as persistent locks have no meaning. However, I don't see any higher-level issue with this?
This however seems like a real issue. I'm pretty sure vfs is easy to make work, but for devmapper and btrfs this is not really possible. We could make it possible for the driver to expose a flag for supporting this or not, and then we could do the splitting of the layers store optionally.
That seems like the "easiest" way, but I haven't yet looked at the storage testsuite. Is long test runtimes a problem for c/s currently?
What would generally happen is that you disconnect the new container from all the global podman state. Like, unless you jump through hoops you can't: get logs for it, see it in podman ps, use podman health-checks, use named volumes, share namespaces with other containers, etc, etc. I'm also unsure how well podman handles the "read-only additional store" changing over time, although I see read locks for it, so that may be ok. (I guess some of these issues are due to the related boltdb transient store changes and not necessarily the storage changes, but we want a complete solution so they need to be considered together.) The initial idea for transient mode was actually similar to this, i.e. just point rootdir and runroot somewhere on tmpfs, or mount a tmpfs on rootdir. Having a "no fsync" mode for storage is quite similar to this. The core issue here is that there is a mix of things in storage, some that we want to persist (backing store for image layers, image db, backing store for volumes, etc), and some we don't. The databases mix information about these, so any update has to be fully robust, even when just changing information about transient stuff. Splitting this up is really where the complexity of this PR comes from. I don't really see a way to avoid the splitting though, because fundamentally we do want to have some persisted state, and we do want to use that global state with the transient state. However, maybe we can split it in a different way. For example, we could have a shared I'll have a look at the two-json-files approach and see how it would look.
For our particular usecase, I don't think this is a problem, because we expect the install of an image to be separate from the running of it, but that depends on the user though. |
I guess a disclaimer: I’ve been digging into the code recently, but I don’t really have a long-term understanding of the concerns and trade-offs.
(That wouldn’t work well for unprivileged users in “rootless” mode, OTOH they probably wouldn’t enable the transient mode.)
I don’t have a good understanding how this works, but IIRC the current cross-store support for additional stores is quite ad-hoc: some parts are implemented at the high level of Now that could be an argument for cleaning up and generalizing, as well as an argument for trying to minimize use of complexity like this…
My first guess is that the tests could be run in parallel, at an extra monetary cost.
Right, the BoltDB database (or just some parts of it?) would also need to be moved to tmpfs...
IIRC it’s expected to work, at least we suggest usage where the additional store is centrally-managed, which basically requires it to be robust against changes. (Removing a layer from the additional store can always break the primary one, until an image is (deleted and?) re-pulled.)
That’s true. I guess where I struggle is whether this absolutely needs to be an additional mode. Maybe it does, because we need some updates to the usually-persistent state anyway, and that’s where the actual cost is? Or, uh, should Podman actually be much smarter, and automatically use transient stores, with non-durable writes, for That idea looks too tempting and almost “cute”, that I’m almost sure there’s a good reason that couldn’t work… and of course it would mean extending the current two-layer-store approach to the container store (and the BoltDB data)?
I really don’t know enough to predict how this would end up. It might be much simpler if (BTW I don’t think having two JSON files inherently requires them to be recorded in the same directory; they could use the proposed two-directory layout probably just as easily.) What’s the core reason of the performance difference? Is almost all of it the |
Actually, the cleanup issue needs a bit more thinking in general. It turns out that the However, the layers.json file, as well as the config.json file does then reference the actual backing store for the driver, which in the overlay backend is the
Yes, this is important, and already done in the podman pr (containers/podman#16371)
This is problematic at least for the bold db. We rely on the transactions it uses for atomicity and verification of cross-table links (like which containers use which volume, etc). We can't split the db file in two files and still get ACID behaviour. This is why this was made a global mode, because I agree with you that it would be cool if
I'm currently looking at this, and it seems to end up as a much smaller change, so it seems a better approach. Will push new version later today.
The performance is basically all in the fdatasync(). Although in the automotive usecase we are also quite interested in the more robust behaviour that comes with starting from a fresh set of databases on boot too, because that limits the risk of weird shutdown issues making the car not start up when booted again. |
Thanks, that’s interesting to know. |
Could you just use $RUNDIR/overlay-containers for transient containers by default? Since only small stuff is stored there. |
4353843
to
8ea8205
Compare
Ok, new version. Rather than having two stores now we just have one container store and one (rw) layer store. However, both stores save two json files, the regular one and a "volatile" one. The volatile one is saved using NoSync, meaninging it is faster two write, but any write to it risks losing all the data in it (not just the modification written), so we don't want to store any persistent info in it. In addition, when in transient mode, the volatile files are stored in the rundir. For normal mode, volatile containers are those with In transient mode, all container and all layers for containers are volatile. This brings us a performance improvement for regular podman run --rm containers: orig: Benchmark 1: bin/podman.orig run --rm --pull=never --network=host --security-opt seccomp=unconfined fedora true new: Benchmark 1: bin/podman run --rm --pull=never --network=host --security-opt seccomp=unconfined fedora true And an even better improvement (includes e.g. transient bolt db) transient mode: Benchmark 1: bin/podman run --transient-store=true --rm --pull=never --network=host --security-opt seccomp=unconfined fedora true |
I resolved a bunch of outdated comments. |
the last version is much nicer, LGTM I've some comments with the podman integration though. I think the setting in storage must be always honored if set (unless it is overridden to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a fairly quick skim for now.
This looks very nice!
The containers.go
review comments mostly apply to layers.go
equally.
return err | ||
for _, location := range containerLocations { | ||
rpath := r.containerspath(location) | ||
if err := os.MkdirAll(filepath.Dir(rpath), 0700); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually done by newContainerStore
already, so maybe we can avoid that syscall or two on every write.
Alternatively, at least move it below the locationModified
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexlarsson WDYT ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno. All the various stores seem to do this, there might be some weird reason for it. If we do this, that should be in its own PR imho.
8ea8205
to
a7f15ce
Compare
New version again, this time the location are treated as flags instead of an enum, so we can just pass the set of all locations to write to save(), and we can drop the whole "mark as dirty" thing. Its not quite what @mtrmac said, but I think this is clean while at the same time keeping the old public Save() method around with unchanged behaviour for compat. |
a7f15ce
to
bb3c90f
Compare
Now dropped the Save() calls |
bb3c90f
to
0a42390
Compare
Hmmm, it feels wrong to have layers.lock in rundir, while the non-transient layers.json is not in rundir. What if something uses this directory as a additionaldirectory. Seems safer to have the lock next to the layers.json. For the containers case it should be fine though. |
0a42390
to
4a7f517
Compare
Great point, yes. (And if this intentionally differs between containers and layers, please add a comment documenting this rationale to emphasize the two should stay different.) |
4a7f517
to
012ba21
Compare
So. I'm pretty happy with how this is now. The only thing remaining imho is to add some kind of operation that cleans up the store for leftovers from transient containers. This shouldn't really be that hard. For the For the For To avoid races i think you need to (in this order):
We can make this optional and just implement if for overlay and vfs initially. |
needs a rebase |
bd8effb
to
f16e34d
Compare
LGTM, but there are a couple of @mtrmac questions remaining. @nalind @mtrmac @vrothberg @saschagrunert @flouthoc @umohnani8 PTAL |
@rhatdan most of the questions were stale, i cleared those. The remaining i think needs to be an separate pr. |
Lemme rebase also. |
This is to allow easy use of the NoSync option. Signed-off-by: Alexander Larsson <[email protected]>
This just adds the support for setting and handling this key, it doesn't really do anything yet. Signed-off-by: Alexander Larsson <[email protected]>
This splits up the containers.json file in the containers store into two, adding the new file `volatile-containers.json`. This new file is saved using the NoSync options, which is faster but isn't robust in the case of an unclean shutdown. In the standard case, only containers marked as "volatile" (i.e. those started with `--rm`) are stored in the volatile json file. This means such containers are faster, but may get lost in case of an unclean shutdown. This is fine for these containers though, as they are not meant to persist. In the transient store case, all containers are stored in the volatile json file, and it (plus the matching lock file) is stored on runroot (i.e. tmpfs) instead of the regular directory. This mean all containers are fast to write, and none are persisted across boot. Signed-off-by: Alexander Larsson <[email protected]>
We store information about the layers in two files, layers.json and volatile-layers.json. This allows us to treat the two stores differently, for example saving the volatile json with the NoSync option, which is faster (but less robust). In normal mode we store only layers for the containers that are marked volatile (`--rm`), as these are not expected to persist anyway. This way informantion about such containers are faster to save, and in case of an unclean shutdown we only risk loss of information about other such volatile containers. In transient mode all container layers (but not image layers) are stored in the volatile json, and additionally it is stored in tmpfs. This improving performance as well as automatically making sure no container layers are persisted across boots. Signed-off-by: Alexander Larsson <[email protected]>
This looks in the container store for existing data dirs with ids not in the container files and removes them. It also adds an (optional) driver method to list available layers, then uses this and compares it to the layers json file and removes layers that are not references. Losing track of containers and layers can potentially happen in the case of some kind of unclean shutdown, but mainly it happens at reboot when using transient storage mode. Such users are recommended to run a garbage collect at boot. Signed-off-by: Alexander Larsson <[email protected]>
If STORAGE_TRANSIENT is set to 1 then transient mode will be enabled in the tests. Also adds overlay-transient to the cirrus CI. Signed-off-by: Alexander Larsson <[email protected]>
96a93c7
to
d6cb12d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is the containers/storage part of containers/podman#16371
It adds a new global storage mode option called 'transient_store'. If this is true then the primary databases for container-specific data (i.e. not image data) is made transient, by placing the storage in runroot (i.e. on tmpfs).
In particular, for c/s this implies:
.../storage/overlay-containers
is renamed.../storage/overlay-transientcontainers
, and itslayers.json
is stored in the runroot.../storage/overlay-transientlayers/
that is similar to../storage/overlay-layers/
, but is only used for the container layers (not images), and that has thelayers.json
file in runroot.This completely erases the time spent fdatasync() calls for the json files written during
podman run
.Note that this only moves the "database files" to runroot. The actual storage etc is still on persistent storage (to avoid filling memory). On a fresh boot we will always get empty databases so the old files will not be referenced. This means no old state will get in the way and mess things up in case of an unclean shutdown. However, the space is not reclaimed, so this should ideally be paired with code in the podman packaging to clear these directories on boot. For this reason the transient directories have different names so they are easy to clean up,