Watch stops working after 4 days and receives duplicate events on delete pod #1239

Kruti-Joshi · 2023-03-14T05:50:30Z

Kruti-Joshi
Mar 14, 2023

Describe the bug
I'm implementing a Pod watcher using the C# Kubernetes client. https://github.com/kubernetes-client/csharp
This is the current implementation we have -

public async Task WatchPodsAsync()
        {
            Log.Information($"Starting pod watcher");
            string lastResourceVersion = null;

            // Watcher connection resets if there are no changes for some time (approx. 20 min), so we need to fetch the list again everytime this happens.
            do
            {
                // A regular list call is needed to fetch the current resource version of the pod list. For the next call, when the connection is reset, this resource version will be specified in the watch call to get everything after this current resource version, so that the watcher doesn't receive all events from scratch.
                Log.Information("Refetching the list");
                var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch, resourceVersion: lastResourceVersion);
                lastResourceVersion = resetPodList.ResourceVersion();
                var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);
                try
                {
                    await foreach (var (type, item) in podList.WatchAsync<V1Pod, V1PodList>())
                    {
                        Log.Information($"Watcher detected event of type {type} for pod {item.Metadata.Name}. Status of the pod is {item.Status.Phase}.");
                        // some action
                    }
                }
                catch (Exception ex)
                {
                    Log.Warning(ex, $"Exception occured while watching pods. Establishing the connection again.");
                }
            }
            while (true);
        }

Because of the issue we faced earlier about the connection being closed because of inactivity, we put the watcher in an infinite loop so that the connection is re-established every time it is lost.
Now I'm seeing 2 major issues in the watcher -

After 4 days, the watcher stopped receiving events.
Earlier, the connection was being lost in an hour. After putting the implementation in an infinte loop, I can see that the connection is being re-established and i can see the log "Refetching the list", but suddenly after 4 days, there were no events being received. On restarting the service, the watcher started receiving updates again and worked as expected.
Is there something wrong in this implementation? Why would the connection be lost after 4 days, event under an infinite loop?
Every time the watcher is receiving a delete event, it also receives the previous event of the pod.
E.g. - The below pod had already completed and watcher had received the update. When I deleted the pod using kubectl delete pod, watcher received the Modified type update once again, before the delete type.

[2023-03-14 05:06:43.722 +00:00] [INF] Watcher detected event of type
Modified for pod e0c7205c-e096-4ee0-b9c4-7b044e442970-krf8v. Status of
the pod is Succeeded.
[2023-03-14 05:08:09.478 +00:00] [INF] Watcher detected event of type Modified for pod e0c7205c-e096-4ee0-b9c4-7b044e442970-krf8v.
Status of the pod is Succeeded.
[2023-03-14 05:08:09.483 +00:00] [INF] Watcher detected event of type Deleted for pod e0c7205c-e096-4ee0-b9c4-7b044e442970-krf8v.
Status of the pod is Succeeded.

Is this an expected behaviour? We are keeping a count of how many runs were triggered, and how many completed, and receiving the modified 'succeeded' event again is throwing off our count. Is there something wrong in the implementation that can be corrected to stop receiving this event twice?

Kubernetes C# SDK Client Version
8.0.68

Server Kubernetes Version
1.24.6

Dotnet Runtime Version
net7

To Reproduce
Have the watcher running for at least 5-6 days.
While the watcher is running, see that the event for the latest Completed pod has already been received. Now delete the pod using
kubectl delete pod.. and check the logs again. Just before the delete event, the watcher has also received the completed event once again.
After 4-5 days, check if the watcher still receives the events. The events suddenly stop coming, even though we have an infinite loop to reset the connection. Restart the service. Notice that the watcher starts working again and performs actions on the received events.

Expected behavior
The watcher should receive the completed event only once.
If the connection is lost, because of the infinite loop, the connection should be reset again and the watcher should continue working.

Where do you run your app with Kubernetes SDK (please complete the following information):

OS: Linux
Environment: Kubenetes container
Cloud: Azure

Answered by Kruti-Joshi

Apr 11, 2023

The issue was resolved by removing ConfigureAwait(false) in the startup during initialization.
The above code remains the same.
In startup, we earlier had
service.WatchPodsAsync().ConfigureAwait(false);
and this was changed to
service.WatchPodsAsync()

The watcher is no longer stopping.

View full answer

tg123 · 2023-03-14T08:42:57Z

tg123
Mar 14, 2023
Maintainer

1 most likely your version is garbage collected. that is why restarting fixed the issue. (version var reset)
i believe ListNamespacedPodAsync may throw 410 exception and break your loop

2 are those modified events just something related to pod state change? for example, terminating?

0 replies

Kruti-Joshi · 2023-03-14T14:47:12Z

Kruti-Joshi
Mar 14, 2023
Author

So will we have to catch 410 explicitly, and it won't be caught in the generic Exception block? If it will be caught, it should enter the loop again, right?

Yes the events are related to the pod state change. The first event is received when the pod completes successfully and the second event is when the pod is deleted.

0 replies

tg123 · 2023-03-15T11:31:16Z

tg123
Mar 15, 2023
Maintainer

i mean your WatchPodsAsync stopped due to uncaught ex

public async Task WatchPodsAsync()
        {
            Log.Information($"Starting pod watcher");
            string lastResourceVersion = null;

            // Watcher connection resets if there are no changes for some time (approx. 20 min), so we need to fetch the list again everytime this happens.
            do
            {
                // A regular list call is needed to fetch the current resource version of the pod list. For the next call, when the connection is reset, this resource version will be specified in the watch call to get everything after this current resource version, so that the watcher doesn't receive all events from scratch.
                Log.Information("Refetching the list");

  ex throw here ------------ > var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch, resourceVersion: lastResourceVersion);

                lastResourceVersion = resetPodList.ResourceVersion();
                var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);
                try
                {
                    await foreach (var (type, item) in podList.WatchAsync<V1Pod, V1PodList>())
                    {
                        Log.Information($"Watcher detected event of type {type} for pod {item.Metadata.Name}. Status of the pod is {item.Status.Phase}.");
                        // some action
                    }
                }
                catch (Exception ex)
                {
                    Log.Warning(ex, $"Exception occured while watching pods. Establishing the connection again.");
                }
            }
            while (true);
        }

0 replies

Kruti-Joshi · 2023-03-15T12:01:41Z

Kruti-Joshi
Mar 15, 2023
Author

I see. Let me make that change and observe. Thanks for your reply!

0 replies

Kruti-Joshi · 2023-03-15T17:05:55Z

Kruti-Joshi
Mar 15, 2023
Author

@tg123 , do you think it's better to always initialize the list with resourceversion null, or should I set the resource version to null if we receive a 410?
i.e., if I remove resource version from the list call before watch, what would be the implication? I couldn't understand fromthe documentation. When I'm trying it out in my code, even with resource version, I'm getting the list of all pods on the cluster at that time.

Log.Information("Refetching the list");
                var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch);
                lastResourceVersion = resetPodList.ResourceVersion();
                var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);

Or should I catch the exception and set the resource version to null, as below -

do
            {
                // A regular list call is needed to fetch the current resource version of the pod list. For the next call, when the connection is reset, this resource version will be specified in the watch call to get everything after this current resource version, so that the watcher doesn't receive all events from scratch.
                Log.Information("Refetching the list");
                var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch, resourceVersion: lastResourceVersion);
                lastResourceVersion = resetPodList.ResourceVersion();
                var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);
                try
                {
                    await foreach (var (type, item) in podList.WatchAsync<V1Pod, V1PodList>())
                    {
                        Log.Information($"Watcher detected event of type {type} for pod {item.Metadata.Name}. Status of the pod is {item.Status.Phase}.");
                        // some action
                    }
                }
                catch (HttpOperationException ex) when (ex.Response.StatusCode == System.Net.HttpStatusCode.Gone)
                {
                    Log.Warning($"Last resource version {lastResourceVersion} garbage collected. Setting to null to refetch entire list");
                    lastResourceVersion = null;
                }
                catch (Exception ex)
                {
                    Log.Warning(ex, $"Exception occured while watching pods. Establishing the connection again.");
                }
            }
            while (true);

How would the behavior differ in these 2 scenarios?

The one difference that I understood if that if I initialize the list without a resource version and then I watch on it, I will receive the 'Added' type event (when pod gets created) for all the existing pods in the namespace, regardless of what the current status of those pods are. Is this understanding correct? Will there be any other implication?

0 replies

tg123 · 2023-03-16T03:19:05Z

tg123
Mar 16, 2023
Maintainer

you can update lastResourceVersion to item.meta.resoureversion after you processed the item
however, you still have to handle version gone ex because the version you saved is not guaranteed alive in server side.

0 replies

Kruti-Joshi · 2023-03-20T07:26:30Z

Kruti-Joshi
Mar 20, 2023
Author

@tg123, I made the recommended update as below -

public async Task WatchPodsAsync()
        {
            Log.Information($"Starting pod watcher");
            string lastResourceVersion = null;

            // Watcher connection resets if there are no changes for some time (approx. 20 min), so we need to fetch the list again everytime this happens.
            do
            {
                try
                {
                    // A regular list call is needed to fetch the current resource version of the pod list. For the next call, when the connection is reset, this resource version will be specified in the watch call to get everything after this current resource version, so that the watcher doesn't receive all events from scratch.
                    Log.Information("Fetching the pod list");
                    var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch);
                    lastResourceVersion = resetPodList.ResourceVersion();
                    var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);

                    await foreach (var (type, item) in podList.WatchAsync<V1Pod, V1PodList>())
                    {
                        Log.Information($"Watcher detected event of type {type} for pod {item.Metadata.Name}. Status of the pod is {item.Status.Phase}.");
                        // do some action
                    }
                }
                catch (HttpOperationException ex) when (ex.Response.StatusCode == System.Net.HttpStatusCode.Gone)
                {
                    Log.Warning(ex, $"Resource version being watched has been garbage collected by Kubernetes. Establishing the connection again.");
                }
                catch (Exception ex)
                {
                    Log.Warning(ex, $"Exception occured while watching pods. Establishing the conection again.");
                }
            }
            while (forever);
        }

But the same thing happened again on Sunday March 19 3:10 pm UTC. After the update received on 3:10 pm, the watcher stopped receiving any further events and no exception was logged. How can I figure out what went wrong?

Is there a way to force this method to restart every 30 minutes?
E.g. if I pass a cancellation token with a timer of 30 minutes, with WatchAsync forcefully stop and reset?

public async Task StartAsync()
        {
            do
            {
                var cancellationSource = new CancellationTokenSource(TimeSpan.FromMinutes(30));
                Log.Information("Starting the timed Pod Watcher");
                await WatchPodsAsync(cancellationSource.Token);
                Log.Information("Timer elapsed. Exited from Pod Watcher.");
                cancellationSource.Dispose();
            }
            while (true);
        }

public async Task WatchPodsAsync(CancellationToken cancellationToken)
        {
            Log.Information($"Started the pod watcher");
            string lastResourceVersion = null;

            do
            {
                try
                {
                    Log.Information("Fetching the pod list");
                    var resetPodList = await client.CoreV1.ListNamespacedPodAsync(NamespaceToWatch);
                    lastResourceVersion = resetPodList.ResourceVersion();
                    var podList = client.CoreV1.ListNamespacedPodWithHttpMessagesAsync(NamespaceToWatch, resourceVersion: lastResourceVersion, watch: true);

                    await foreach (var (type, item) in podList.WatchAsync<V1Pod, V1PodList>())
                    {
                        Log.Information($"Watcher detected event of type {type} for pod {item.Metadata.Name}. Status of the pod is {item.Status.Phase}.");
                        // do some action
                    }
                }
                catch (HttpOperationException ex) when (ex.Response.StatusCode == System.Net.HttpStatusCode.Gone)
                {
                    Log.Warning(ex, $"Resource version being watched has been garbage collected by Kubernetes. Establishing the connection again.");
                }
                catch (Exception ex)
                {
                    Log.Warning(ex, $"Exception occured while watching pods. Establishing the conection again.");
                }
            }
            while (!cancellationToken.IsCancellationRequested);
        }

4 replies

Kruti-Joshi Mar 20, 2023
Author

I see that there is a closed discussion on adding a cancellation token for the watcher - #584.
Was this implemented? I don't see an option to use it.

tg123 Mar 22, 2023
Maintainer

await foreach ().WithCancellation(token).ConfigureAwait(false))

tg123 Mar 22, 2023
Maintainer

could you please attach the app using debugger? the code seems good to me

Kruti-Joshi Mar 22, 2023
Author

I'm running this as an async task in the background, calling this from Startup -

service.WatchPodsAsync().ConfigureAwait(false);

Could that be causing an issue? An async task should be able to run as long as the app is running, I believe. But could the task suddenly have been killed?

Kruti-Joshi · 2023-04-11T05:05:21Z

Kruti-Joshi
Apr 11, 2023
Author

The issue was resolved by removing ConfigureAwait(false) in the startup during initialization.
The above code remains the same.
In startup, we earlier had
service.WatchPodsAsync().ConfigureAwait(false);
and this was changed to
service.WatchPodsAsync()

The watcher is no longer stopping.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch stops working after 4 days and receives duplicate events on delete pod #1239

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Watch stops working after 4 days and receives duplicate events on delete pod #1239

Kruti-Joshi Mar 14, 2023

Replies: 8 comments · 4 replies

tg123 Mar 14, 2023 Maintainer

Kruti-Joshi Mar 14, 2023 Author

tg123 Mar 15, 2023 Maintainer

Kruti-Joshi Mar 15, 2023 Author

Kruti-Joshi Mar 15, 2023 Author

tg123 Mar 16, 2023 Maintainer

Kruti-Joshi Mar 20, 2023 Author

Kruti-Joshi Mar 20, 2023 Author

tg123 Mar 22, 2023 Maintainer

tg123 Mar 22, 2023 Maintainer

Kruti-Joshi Mar 22, 2023 Author

Kruti-Joshi Apr 11, 2023 Author

Kruti-Joshi
Mar 14, 2023

Replies: 8 comments 4 replies

tg123
Mar 14, 2023
Maintainer

Kruti-Joshi
Mar 14, 2023
Author

tg123
Mar 15, 2023
Maintainer

Kruti-Joshi
Mar 15, 2023
Author

Kruti-Joshi
Mar 15, 2023
Author

tg123
Mar 16, 2023
Maintainer

Kruti-Joshi
Mar 20, 2023
Author

Kruti-Joshi Mar 20, 2023
Author

tg123 Mar 22, 2023
Maintainer

tg123 Mar 22, 2023
Maintainer

Kruti-Joshi Mar 22, 2023
Author

Kruti-Joshi
Apr 11, 2023
Author