Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource-leaked logs-collector-XXXX containers #1833

Closed
mieubrisse opened this issue Nov 19, 2023 · 1 comment · Fixed by #1870
Closed

Resource-leaked logs-collector-XXXX containers #1833

mieubrisse opened this issue Nov 19, 2023 · 1 comment · Fixed by #1870
Assignees
Labels
bug Something isn't working cli For bugs relating to the CLI critical Critical bug or feature

Comments

@mieubrisse
Copy link
Collaborator

What's your CLI version?

0.85.29

Description & steps to reproduce

My local Docker-backed Kurtosis is empty:

kurtosis enclave ls
UUID   Name   Status   Creation Time

kurtosis clean -a doesn't clean anything, as expected.

However, there are two zombie logs-collector containers hanging around that don't go away:

Screen Shot 2023-11-19 at 13 27 21

These do not correspond to any enclave networks, as there are none (as expected):

$ docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
8662e507c238   bridge    bridge    local
deab0a0cb95e   host      host      local
a8c59496b516   none      null      local

More details in #1832

Desired behavior

Kurtosis should always fully clean up after itself.

The original clean logic was written like so:

  1. Use the presence/absence of -a to decide which enclaves ought to remain (if -a is specified, no resources should remain)
  2. After finding the enclaves-to-remain, find and kill all Kurtosis resources that aren't in the enclave-to-remain set in the proper order (containers -> volumes -> networks). We DON'T say "look for enclave networks, and then remove resources associated with them" because if the enclave network is removed without all the resources being removed for any reason (e.g. maybe a failure happened), then we permanently leak resources.

In other words, we first grab ALL Kurtosis resources, default them all to "destroy this resource", and then remove the ones that should be preserved. This guarantees that we process every single resource Kurtosis can create.

This algorithm is very important, because it guarantees that even if an enclave is in an inconsistent state (e.g. an error happened, or maybe the user messed with Docker), all the resources for the enclave can still get fully cleaned up.

I suspect that the logs-collector cleanup has not been written this way, and is instead written with, "first try to remove the enclave network and all the containers in it, then try to remove the logs-collector". This means that if a failure happens (e.g. because of Ctrl-C), then the enclave network is lost but the logs-collector hasn't been cleaned up.

What is the severity of this bug?

Critical; I am blocked and Kurtosis is unusable for me because of this bug.

What area of the product does this pertain to?

CLI: the Command Line Interface

@mieubrisse mieubrisse added the bug Something isn't working label Nov 19, 2023
@github-actions github-actions bot added cli For bugs relating to the CLI critical Critical bug or feature labels Nov 19, 2023
@tedim52
Copy link
Contributor

tedim52 commented Nov 22, 2023

Confirming that the clean logic is as you described it should be (at least on the surface). I confirmed by ensuring that the log collector was one of the containers removed in the destroyContainersInEnclaves step that happens before destroying volumes then destroying the enclave networks.

Going to continue debugging what may be causing the leaked logs collectors.

github-merge-queue bot pushed a commit that referenced this issue Nov 30, 2023
## Description:
Fixes: #1833

The resource leak was being caused by a state where if a failure
occurred while creating an enclave (eg. ctrl+C while running `kurtosis
enclave add`) at a point where the log collector container was created
BEFORE it had been connected to the network AND before the necessary
`defer undo`s to clean up the log collector had been queued THEN the
network would get cleaned up by the `defer undo` from `CreateNetwork`,
but the log collector container would remain.

Any attempt to do a `kurtosis clean -a` would fail to clean log
collector container because the network(and thus enclave) the container
was created for had already been cleaned up.

After digging - I realized the log collector was being created in
`container-engine-lib` as opposed to in the `engine` Moving the logic to
the engine AFTER the enclave is created fixes the issue. If there is an
error at ANY point in the creation of the log collector container (even
if the log collectors `defer undo` hasn't been queued), the `defer undo`
from the `CreateEnclave` call will clean the log collector because it
has a label with the `enclaveUUID`.

## Is this change user facing?
NO

## References:
#1833
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cli For bugs relating to the CLI critical Critical bug or feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants