Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container-remove event does not set exit code #19124

Closed
martinpitt opened this issue Jul 5, 2023 · 18 comments · Fixed by #20609
Closed

container-remove event does not set exit code #19124

martinpitt opened this issue Jul 5, 2023 · 18 comments · Fixed by #20609
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@martinpitt
Copy link
Contributor

Feature request description

cockpit-podman is plagued with lots of race conditions and flaky tests. I have investigated many of them, but the remaining ones are due to a fundamental issue with the monitoring API.

The UI uses the libpod/events API, which notifies about high-level actions such as start or died, for example:

{"status":"start","id":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","from":"localhost/test-busybox:latest","Type":"container","Action":"start","Actor":{"ID":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","Attributes":{"containerExitCode":"0","image":"localhost/test-busybox:latest","name":"swamped-crate","podId":""}},"scope":"local","time":1688533611,"timeNano":1688533611933738591}

However, this does not contain any (or at least most) of the properties that the UI needs to show, so in reaction to these, the UI does a containers/json query for that container:

{"method":"GET","path":"/v1.12/libpod/containers/json","body":"","params":{"all":true,"filters":"{\"id\":[\"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2\"]}"}}

which then responds with all the info that the UI needs:

[{"AutoRemove":false,"Command":["sh","-c","echo 123; sleep infinity"],"Created":"2023-07-05T05:06:41.355776664Z","CreatedAt":"","Exited":false,"ExitedAt":1688533611,"ExitCode":0,"Id":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","Image":"localhost/test-busybox:latest","ImageID":"24ac8b76cfb0440579ade1908a8a765d3c8a62bd366058cf84e1a7d6754ee585","IsInfra":false,"Labels":null,"Mounts":[],"Names":["swamped-crate"],"Namespaces":{},"Networks":[],"Pid":19176,"Pod":"","PodName":"","Ports":null,"Size":null,"StartedAt":1688533611,"State":"running","Status":""}]

The problem is that this is racy: The /containers/json call is necessarily async, and when events come in bursts, they will then overlap. But their replies from podman are not coming in in the same order. This is a log capture from a part of the test where it does a few container operations like stopping and restarting a container. I stripped out all the JSON data for clarity, the important bit is the ordering:

> debug: podman user call 44:
> debug: podman user call 45:
> debug: podman user call 45 result:
> debug: podman user call 46:
> debug: podman user call 44 result:
> debug: podman user call 47:
> debug: podman user call 46 result:
> debug: podman user call 47 result:
> debug: podman user call 48:
> debug: podman user call 43 result:
> debug: podman user call 49:
> debug: podman user call 49 result:
> debug: podman user call 50:
> debug: podman user call 48 result:
> debug: podman user call 51:
> debug: podman user call 50 result
> debug: podman user call 52:
> debug: podman user call 52 result:
> debug: podman user call 51 result:

So if the container moves from "Running" → "Exited" → "Stopped" → "Restarting" → "Running", a jumbled response order can lead to swaps, and the final state reported in the UI is e.g. "Restarting" or "Exited". The latter happened in this run, where the screenshot says "Exited", but podman ps says "Up" (i.e. "Running"), as can be seen in the "----- user containers -----" dump in the log.

Suggest potential solution

My prefered solution would be to avoid having to call /containers/json after a "start" or "rename" event in the first place. That only leads to additional API traffic and thus more computational overhead on both the podman and the UI side, and is prone to these kinds of race conditions. D-Bus services like systemd or udisks generally solved this with the PropertiesChanged signal, i.e. there is a notification with the set of changed properties each time when there is a change. These are naturally ordered correctly, and the watcher can tally them up to always have an accurate model of the state without having to do extra "get" calls.

For the podman API, this cannot just be squeezed into the existing start (or remove, etc.) events, as the container properties can change more often, and also independently from the coarse-grained lifecycle events.

Perhaps this could introduce a new event type changed that gets fired whenever any property changes, and deliver the /containers/json info for the container(s) which changed. Both "your" (podman) and "my" (cockpit-podman) sides already have the code to generate/parse this information, it would just mean some minor plumbing changes.

If this is expensive, it may also be adequate to explicitly opt into getting these notifications, although connecting to /events generally already means that the listener wants to know this kind of information.

Have you considered any alternatives?

It may also be possible to change podman to not reply to requests out of order. I don't know how easy/hard that is with Go and coroutines. I know that it is very hard in JavaScript on the client side to reorder the replies.

It might be easier on our side to completely serialize all API calls, but that would make the UI very slow especially if there are many containers. These are independent from each other, so serializing calls is not conceptually necessary.

Additional context

No response

@martinpitt martinpitt added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2023
@martinpitt
Copy link
Contributor Author

@jelly FYI

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 5, 2023
Don't start pod and container state updates at the same time. They do
asynchronous state updates which sometimes step on each other's toes and
reverse the real event order.

This mitigates the problem with getting podman API call results in the
wrong order (containers/podman#19124);
that causes containers to sometimes appear in a non-current state.
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 5, 2023
Don't start pod and container state updates at the same time. They do
asynchronous state updates which sometimes step on each other's toes and
reverse the real event order.

This mitigates the problem with getting podman API call results in the
wrong order (containers/podman#19124);
that causes containers to sometimes appear in a non-current state.
martinpitt added a commit to cockpit-project/cockpit-podman that referenced this issue Jul 5, 2023
Don't start pod and container state updates at the same time. They do
asynchronous state updates which sometimes step on each other's toes and
reverse the real event order.

This mitigates the problem with getting podman API call results in the
wrong order (containers/podman#19124);
that causes containers to sometimes appear in a non-current state.
@martinpitt
Copy link
Contributor Author

I just talked to @vrothberg , many thanks! Notes:

podman has a mode for logging more information in events, but it's opt-in (containers.conf), so nothing that c-podman can rely on.

But it should be possible to greatly improve this with the existing API:

  • Only do a /json inspect for the create event, to get image, mounts, pod etc. -- this information won't change
  • predict the future state for events like 'died', 'start'; the start time is contained in the 'start' event
  • ignore all transient state events like 'restart', 'mount', they will always be follow by an event for a persistent state (like 'died' or 'start')
  • also ignore unrelated states like 'prune'

With that, there should be by and large only one /json inspect instead of umpteen, which should avoid the problem to a degree where it's not relevant any more.

I'll try that in the next days (well, some PTO ahead), and report back here. In the best case, I'll just close this.

Thanks!

@vrothberg
Copy link
Member

@martinpitt and I had a quick call on discussing the issue. We concluded that there is a way to significantly reduce the amount of API calls emitted by Cockpit.

At the time of writing, Cockpit inspects a container on (too) many container events. As mentioned in the issue description above, the inspections are done to get the current state of the container. We can reduce the inspects to only 1 by doing the following:

  • On a create event we know that a new container is there. To get all the metadata, we need to inspect it. That is the 1 call we need.
  • For all other events, no inspect is needed.
  • To display state (transitions) in the UI, Cockpit can interpret incoming events.
    • create -> Created
    • init -> Initialized
    • start -> Running
    • died -> Exited
    • remove -> Removed (stop displaying the container in the UI)

When starting Cockpit, then all containers must be listed/inspected to get all the metadata and state.

Thanks for your summary, @martinpitt :) I think we have a shared understanding.

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 5, 2023
These are internal transient states which don't need to reflect in the
UI. They happen quickly in bursts, with a "permanent state" event
following such as "create", "died", or "remove". This helps to reduce
the API calls and thus mitigates out-of-order results; see
containers/podman#19124
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 5, 2023
These are internal transient states which don't need to reflect in the
UI. They happen quickly in bursts, with a "permanent state" event
following such as "create", "died", or "remove". This helps to reduce
the API calls and thus mitigates out-of-order results; see
containers/podman#19124
@Luap99
Copy link
Member

Luap99 commented Jul 5, 2023

@martinpitt and I had a quick call on discussing the issue. We concluded that there is a way to significantly reduce the amount of API calls emitted by Cockpit.

At the time of writing, Cockpit inspects a container on (too) many container events. As mentioned in the issue description above, the inspections are done to get the current state of the container. We can reduce the inspects to only 1 by doing the following:

* On a `create` event we know that a new container is there.  To get all the metadata, we need to inspect it.  That is the 1 call we need.

* For all other events, no inspect is needed.

* To display state (transitions) in the UI, Cockpit can interpret incoming events.
  
  * `create` -> Created
  * `init` -> Initialized
  * `start` -> Running
  * `died` -> Exited
  * `remove` -> Removed (stop displaying the container in the UI)

When starting Cockpit, then all containers must be listed/inspected to get all the metadata and state.

Thanks for your summary, @martinpitt :) I think we have a shared understanding.

Can we close the issue, sounds like we do not need podman changes?

@vrothberg
Copy link
Member

I leave it open to @martinpitt. I expect the issue to be resolved with the proposed idea but I am OK to leave it open until we're sure.

@martinpitt
Copy link
Contributor Author

I'm inching closer in cockpit-project/cockpit-podman#1324 , so far it's looking good. I should finish this by tomorrow, and then most probably close this, unless something missing turns up.

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 5, 2023
These events change the state in a predictable way. Avoid the expensive
updateContainer() call for these, to avoid multiple calls overlapping
each other. See containers/podman#19124
@martinpitt
Copy link
Contributor Author

* On a `create` event we know that a new container is there.  To get all the metadata, we need to inspect it.  That is the 1 call we need.
* For all other events, no inspect is needed.

Agreed, that would be nice. However, that also means that any kind of runtime related property needs to be part of the event info then.

  * `start` -> Running

This is missing the Pid and StartedAt properties. We can make-do without Pid, and we can probably estimate StartedAt with the event's time property. So that still seems okay. (I haven't yet put this into practice yet, still working on rearranging the event handling -- that requires some serialization with the /containers/json request).

  * `died` -> Exited

This is missing the ExitCode and Error properties, which cannot be inferred. FinishedAt etc. can again be estimated by the event's time.

BTW, as @vrothberg was mentioning log flooding: Every single event repeats the Attributes map -- these are properties of the image, and will never change. The only exception is the podId, which feels like it really shouldn't be in that image attributes list, it's a runtime property, not an image one -- and there's also the possibility of namespace collision. This is a separate issue, I filed it as #19143.

@vrothberg
Copy link
Member

This is missing the Pid and StartedAt properties. We can make-do without Pid, and we can probably estimate StartedAt with the event's time property. So that still seems okay.

Yes, that will work.

This is missing the ExitCode and Error properties, which cannot be inferred.

Interesting! Should we make the exit code part of the attributes? @Luap99 WDYT?

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 6, 2023
With a burst of events these get called in parallel. But podman does not
return them in the call order [1], which led to non-current state updates.

[1] containers/podman#19124
@martinpitt
Copy link
Contributor Author

Should we make the exit code part of the attributes

Just to avoid misunderstanding -- please not literally .Attributes (see #19143); but to the attributes of the event JSON message, that'd be nice. Thanks!

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 6, 2023
With a burst of events these get called in parallel. But podman does not
return them in the call order [1], which led to non-current state updates.

[1] containers/podman#19124
@Luap99
Copy link
Member

Luap99 commented Jul 6, 2023

Adding the exit code seems useful in general but why are you interested in the PID? That should not add any useful value, by the time you read the PID it may be already reused by a different process (well theoretically at least).

Also what is Error even and why should this be only added to died?

@martinpitt
Copy link
Contributor Author

why are you interested in the PID?

Not particularly -- it's just something that we happen to use in our tests to have a proper indicator when a restart command actually finished, but it's something that we can easily replace (with the time stamp).

so what is Error even

I can't say, I'm afraid. It's part of the /containers/json properties. If it can happen for other events, it should surely be added there as well.

@martinpitt martinpitt changed the title Race-free monitoring of containers property changes Add "ExitCode" property to events Jul 6, 2023
@martinpitt
Copy link
Contributor Author

Retitling / narrowing the scope of this to ExitCode, it seems everything else discussed here has a solution or workaround. Thank you!

martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 13, 2023
These are internal transient states which don't need to reflect in the
UI. They happen quickly in bursts, with a "permanent state" event
following such as "create", "died", or "remove". This helps to reduce
the API calls and thus mitigates out-of-order results; see
containers/podman#19124
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 13, 2023
These are internal transient states which don't need to reflect in the
UI. They happen quickly in bursts, with a "permanent state" event
following such as "create", "died", or "remove". This helps to reduce
the API calls and thus mitigates out-of-order results; see
containers/podman#19124

Also fix the alphabetical sorting of the remaining events.
@martinpitt
Copy link
Contributor Author

Nevermind, the exit code is part of the Attributes: already, at least for the died event:

{"status":"died","id":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","from":"docker.io/library/debian:sid","Type":"container","Action":"died","Actor":{"ID":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","Attributes":{"containerExitCode":"3","image":"docker.io/library/debian:sid","name":"admiring_bhabha","podId":""}},"scope":"local","time":1689235736,"timeNano":1689235736846535953}

It doesn't stick around, though -- the next event is usually "remove", and then it gets reset to 0:

{"status":"remove","id":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","from":"docker.io/library/debian:sid","Type":"container","Action":"remove","Actor":{"ID":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","Attributes":{"containerExitCode":"0","image":"docker.io/library/debian:sid","name":"admiring_bhabha","podId":""}},"scope":"local","time":1689235736,"timeNano":1689235736905768245}

But good enough, I suppose.

@martinpitt martinpitt changed the title Add "ExitCode" property to events Race-free monitoring of containers property changes Jul 13, 2023
@vrothberg
Copy link
Member

I guess so

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

cgiradkar added a commit to cgiradkar/podman that referenced this issue Nov 28, 2023
…om int to int ptr

Added additional check for event type to be remove and set the correct exitcode.
While it was getting difficult to maintain the omitempty notation for Event->ContainerExitCode, changing the type from int to int ptr gives us the ability to check for ContainerExitCode to be not nil and continue operations from there.

closes containers#19124

Signed-off-by: Chetan Giradkar <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Feb 27, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants