Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding volume lifetimes vs on-disk storage #16245

Closed
alexlarsson opened this issue Oct 21, 2022 · 32 comments
Closed

Questions regarding volume lifetimes vs on-disk storage #16245

alexlarsson opened this issue Oct 21, 2022 · 32 comments
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@alexlarsson
Copy link
Contributor

So, this is not necessarily an issue, but something that came up in work/discussions around transient container storage and may be problematic. At least it requires some though.

I've been looking at making /var/lib/container/storage a tmpfs for transient containers in the automotive usecase. This is nice, because we always create containers from scratch, and run them with --rm --read-only so there is no real state anyway. This way there is no risk of any leftover garbled state after a boot, and there are some important performance advantages.

However, we still might want to persist volumes in some way, because that is the way to persist data in this approach. The idea I had was to just change volumes_path to point outside of /var/lib/containers/storage, which would persist the volume data. However, the container metadata is stored in the bolt db, so that gets lost over a reboot.

I tried to just re-create the container on boot, and that works great, as it picks up the existing directory for the volume (because it has the same name). However @vrothberg is a bit worried about this picking up of an existing directory because this may lose some important metadata about the volume (like driver, etc?). So, the question is, should the volume create really succeed if there was already an existing backing store for it, or should it fail? And if should fail, is there some other way we can achieve our goal, like maybe we could put the volume bold db in the volume_path rather than in the storage directory?

@vrothberg
Copy link
Member

@baude @rhatdan @mheon PTAL

@mheon
Copy link
Member

mheon commented Oct 21, 2022

Volume driver shouldn't matter here; nor should volume options. Driver means that the mountpoint of the volume is not managed by Podman, but by the driver; volume_path won't matter because an external program is managing the volume. If non-default options are chosen, Podman will use mount to mount over the volume's directory, in which case it not being empty doesn't matter, we're mounting our own content over it. In brief, any critical volume config is already overriding the storage, so I see no reason we need to change the current behavior.

That said, there is also no solid reason to keep the current behavior. Using existing directories isn't documented anywhere, and it's more of an accident of how we wrote things than anything else. If you see another reason why this breaks things, I have no objection to changing this. podman volume import and podman volume export should cover any use-cases that manually modifying the volume's backing directory would (albeit in a somewhat slower and less convenient fashion).

@alexlarsson
Copy link
Contributor Author

I don't, I actually prefer this behaviour, in that it allows me to reinitialize the transient volume (with the same options) and get the previously written content back. Works great with the "transient storage" model. Although there is a slight fear here about relying on undocumented behaviour.

@vrothberg
Copy link
Member

Yes, I am mostly afraid of relying on undocumented behavior and a probably unintended use case.

Metadata (e.g., labels) is stored in libpod's DB which will be nuked in the described use case. There is "works" and "works as intended" and I think we should be on the latter side.

If the volume data would be stored in --volumepath, we wouldn't run into this issue. Not sure about the impacts though.

@alexlarsson
Copy link
Contributor Author

Here is the WIP code to use this:
https://gitlab.com/CentOS/automotive/sample-images/-/merge_requests/160
With this containers-transient-store rpm:
https://gitlab.com/CentOS/automotive/rpms/containers-transient-store
That just adds a tmpfs .mount unit on top of /var/lib/containers/storage

@alexlarsson
Copy link
Contributor Author

Ok, so what about this approach:

We add a new run option --transient, that marks a container as transient. For any containers that are marked transient, we store the corresponding overlay-containers, containers-layers in runroot instead of root. In addition we add a separate bolt db in runroot (i.e. /run/containers/storage/libpod/bolt_state.db) where we store transient containers.

For volumes and images we keep everything in the regular root as before, so these persist.

Then libpod will need to merge the two bolt databases when e.g. listing or looking up running processes.

This would give us faster start times for transient containers (no syncs when writing the bolt db, or the layers.json/containers.json files), as well as zero chances of transient containers causing weird db issues on any kind of unclean shutdown.

@alexlarsson
Copy link
Contributor Author

Maybe this can even use the "volatile" key that we already have, although one main difference is that with this the actual upper overlayfs layers would be on the tmpfs.

Maybe we can split that too, and have the layers.json/containers.json in runroot, but the backing overlayfs storage on the regular root though?

@vrothberg
Copy link
Member

@containers/podman-maintainers WDYT?

@Luap99
Copy link
Member

Luap99 commented Oct 26, 2022

Then libpod will need to merge the two bolt databases when e.g. listing or looking up running processes.

I think this makes performance worse, having to check two databases will take more time. Sounds also like a horror to maintain how can we guarantee a consistent state when have two databases?


Getting back to your original question: Could you just use regular bind mounts -v /path:/path instead of named volumes? In this case there is no volume metadata.

@vrothberg
Copy link
Member

vrothberg commented Oct 26, 2022

IMO the volumes database belongs on --volumepath. Otherwise, the flag does not seem to be all too useful. At the very least, we should document the current behavior and state that changing --volumepath can lead to issues.

I think this makes performance worse, having to check two databases will take more time.

That really depends on the implementation. It would only slow down if Podman would currently read all data at once but I am pretty sure we're doing subsequent reads. Caching and locking may play a role but this can be optimized in clever ways.

Sounds also like a horror to maintain how can we guarantee a consistent state when have two databases?

Locking will make sure that the tools function properly but we are defenseless against forcefully removing data. Is there a real difference to other cases such as secrets, images, containers, layers, etc?

@alexlarsson
Copy link
Contributor Author

I think this makes performance worse, having to check two databases will take more time.

In actual measurements I've not really seen loading from the db as any noticeable cost. Whereas the writes (that this could avoid) are extremely costly. This is particularly true since the added database would be on tmpfs anyway.

Sounds also like a horror to maintain how can we guarantee a consistent state when have two databases?

There won't really be any conflicts. One db will contain all the "volatile" containers, and the other the rest. We always know at creation time where to save a container.

Getting back to your original question: Could you just use regular bind mounts -v /path:/path instead of named volumes? In this case there is no volume metadata.

Could I? Yes, that would be a workaround to the problem. However, I'm trying to create a product here, not a workaround. And named volumes is an important part of the container ecosystem. For instance, if you want to use this from podman play kube, then that uses PersistentVolumeClaims which impliy named volumes. Not being able to use them is problematic and limits the usefullness of volatile containers.

@Luap99
Copy link
Member

Luap99 commented Oct 26, 2022

IMO the volumes database belongs on --volumepath. Otherwise, the flag does not seem to be all too useful. At the very least, we should document the current behavior and state that changing --volumepath can lead to issues.

I think this makes performance worse, having to check two databases will take more time.

That really depends on the implementation. It would only slow down if Podman would currently read all data at once but I am pretty sure we're doing subsequent reads. Caching and locking may play a role but this can be optimized in clever ways.

Sorry I was referring to In addition we add a separate bolt db in runroot (i.e. /run/containers/storage/libpod/bolt_state.db) where we store transient containers.

I agree that it would make sense to have the volume data stored inside --volumepath.


Sounds also like a horror to maintain how can we guarantee a consistent state when have two databases?

There won't really be any conflicts. One db will contain all the "volatile" containers, and the other the rest. We always know at creation time where to save a container.

Sure we know where to store it but it still needs to check that there are not containers with the given name or ID, we cannot have any collisions. If you do things like podman start/stop/inspect/rm/etc... we now always need to lookup in both dbs.


I think this makes performance worse, having to check two databases will take more time.

In actual measurements I've not really seen loading from the db as any noticeable cost. Whereas the writes (that this could avoid) are extremely costly. This is particularly true since the added database would be on tmpfs anyway.

If writes are the problem than we should move the container state to a tmpfs db. This is really the only part with many writes during all podman operations.

@alexlarsson
Copy link
Contributor Author

If writes are the problem than we should move the container state to a tmpfs db. This is really the only part with many writes during all podman operations.

That would lose the container state on a reboot though? That will not work for non podman run --rm containers.

@Luap99
Copy link
Member

Luap99 commented Oct 26, 2022

AFAIK most the container state should be temporary anyway, it is already cleared on boot (first podman command) here:

func (c *Container) refresh() error {

Looks like some fields are not reset so they might have to stay but we could investigate this further to see what is really needed if it is a performance issue. This would help every work flow and not just one.

@alexlarsson
Copy link
Contributor Author

Some container state is temporary, like obviously the part stored in runroot. But you definately can podman create foo, reboot, and still have a container named foo.

@Luap99
Copy link
Member

Luap99 commented Oct 26, 2022

That is not in the container state, that is part of the container config, check the ContainerState struct in libpod.

@alexlarsson
Copy link
Contributor Author

Ok, I was not that specific with my reference to "state", I just meant state in general.

But, to be more precise, on my machine during a "podman run" these are the boltdb write ops I see:
BoltState.addContainer - 11.58 msec
BoltState.SaveContainer - 3.85 msec (saved before fully setup)
BoltState.SaveContainer - 3.27 msec (fully setup)
BoltState.SaveContainer - 3.12 msec (started)
BoltState.SaveContainer - 2.85 mesc (exited)

That is 25 msec, and a large part of the overhead.
In addition, I see:

AtomicWriteFile storage/overlay-layers/layers.json - 13.76 msec
AtomicWriteFile storage/overlay-layers/layers.json - 8.89 msec
AtomicWriteFile storage/overlay-containers/containers.json 5.57 msec
AtomicWriteFile storage/overlay-layers/layers.json 8.27 msec

Which is another 36 msec.

The exact time of this depends on how large the files are, but generally the actual cost is not from the writes, but the fsyncs. These are the operations I hope to avoid in my case, where none of these writes need to be persisted across boot, yet almost half the time to start a container is spent ensuring on-disk robustness. I want --rm containers to be transient (as they are not meant to survive a reboot anyway) to avoid this.

Its possible that we could get some of this by moving the state to a tmpfs database, but as long as "podman run" writes something that could potentially need to survive a reboot I don't think that can fully avoid this.

@giuseppe
Copy link
Member

IMO the volumes database belongs on --volumepath. Otherwise, the flag does not seem to be all too useful. At the very least, we should document the current behavior and state that changing --volumepath can lead to issues.

I agree, I think this is what we should do and move away the volume data from the main bolt db. We that in place we have the freedom of moving the volume data to persistent storage and the rest is under the graphroot

@mheon
Copy link
Member

mheon commented Oct 27, 2022

I missed a lot here, so catching up:

--volumepath is not a regular-use flag. It's intended to be used similar to flags like --runroot and --tmpdir - set it once, the first time you start up Podman, and it's cached in the DB forever after. We cannot separate the voume DB from the container DB without consistency issues, either - what happens if a container references a volume that isn't in the volume DB? It's a lot easier to be consistent when everything is in one DB.

What is the perceived benefit of splitting the data out? Is this just around not losing volume data when the main db (on tmpfs) gets cleared? I don't really see that as much of a problem. Is there something I'm missing here?

@mheon
Copy link
Member

mheon commented Oct 27, 2022

I suppose we could also consider a volume driver (not the plugin kind, something like the image driver I implemented recently) that would source a volume from a given path. A recreate on reboot would still be required, but we could easily persist metadata alongside the volume (store a text file alongside the volume directory with relevant metadata). Something like:

podman volume create --driver external --opt path=/my/persistent/directory persistentVol

This would be a hell of a lot easier to code up than a completely separate volume database.

@alexlarsson
Copy link
Contributor Author

Ok, so it seems we are relying on using a single bolt db to get transactional behaviour. For example that you can't remove a volume that is being used, or that when deleting a container that it is removed from the volumes it uses. That implies that we can't really split up the bolt database, and the only realistic alternative is to have a completely transient bolt-db (i.e. on tmpfs). For example, we could use a storage.conf option to store the bolt db in runroot instead. And in this case we would have to re-create all volumes as needed (using podman volume create, or implicitly) and rely on this picking up the old volume content. This sounds fine to me.

For the containers.json and layers.json files though, we don't have the same issues. In fact we already support multiple stores, for example layerstore & ro-layerstore. I can easily see extending this with a volatile-layerstore. We would know from the container config already where a container should be stored allowing us to update the right json file.

So, what I propose is:

  • Add global option to put bolt-db on runroot.
  • Declare that "podman create named-volume" will pick up pre-existing volume data in that path (as it already does, just make it documented)
  • Add a new volatile-container-store on runroot and put data about `--rm' (i.e. c.config.Volatile) in it
  • Add a new volatile-layer-store on runroot and put data about the container layers for volatile containers in it.

Optionally we could store a json file in e.g. /var/lib/containers/storage/volumes/myvolume/ next to the _data directory which can be used to pre-populate the bolt db on boot if its stored on runroot.

@rhatdan
Copy link
Member

rhatdan commented Oct 27, 2022

Where is the layer store stored for a --rm container in your stategy? If it is on a tmpfs then I could run out of space rather quickly unless the container is --read-only?

@rhatdan
Copy link
Member

rhatdan commented Oct 27, 2022

It might be worth scheduling a meeting to fully discuss this.

@alexlarsson
Copy link
Contributor Author

I think the layer store could be persistent, but perhaps not mixed with the persistent ones. So, maybe /var/lib/containers/storage/overlay-containers/ and /var/lib/containers/storage/overlay-containers-volatile, where the later is cleaned on boot.

@alexlarsson
Copy link
Contributor Author

Basically you would have something like:

Persistent containers (like now):

  • /var/lib/containers/storage/overlay-containers/containers.json - list of all persistent containers
  • /var/lib/containers/storage/overlay-containers/$ID/ - persistent data
  • /run/containers/storage/overlay-containers/$ID/ - transient data

Volatile containers:

  • /run/containers/storage/overlay-containers-volatile/containers.json - list of all volatile containers
  • /var/lib/containers/storage/overlay-containers-volatile/$ID/ - persistent data
  • /run/containers/storage/overlay-containers-volatile/$ID/ - transient data

Where we clean out /var/lib/containers/storage/overlay-containers-volatile/ on boot.

Actually, the above isn't quite right, as overlay-containers only contains userdata, what I really mean is that the persistant part would be the overlay upper directory, but I still think you understand what I mean.

@alexlarsson
Copy link
Contributor Author

It might be worth scheduling a meeting to fully discuss this.

We have the auto meeting later today.

@alexlarsson
Copy link
Contributor Author

I guess just clearing out the directory at boot is not really compatible with a bolt-db that persists, so the we would also have to remove all volatile containers from the db at boot.

@alexlarsson
Copy link
Contributor Author

So, from the discussion in the meeting today, I think the proposal can be simplified a bit:

Add a new global storage.conf option, that makes all containers transient. In this mode we store the whole bolt db as well as containers.json and the container-related layers in layers.json in runroot (i.e. on tmpfs). We then mention in the volume create docs that it will pick up pre-existing volume directories if they exist at creation time.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 28, 2022

@alexlarsson Are we all set on this now?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@alexlarsson
Copy link
Contributor Author

This should be ok now

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

6 participants