Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447

ManuelHaas · 2024-09-26T07:26:32Z

Hi,

I got the following entries in my log:

2024-09-26 02:26:04.697 +02:00 [ERR] Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.
2024-09-26 02:31:07.708 +02:00 [ERR] Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.

I am reporting it here because that is how it is written in the message.

.NET 8.0.8
Hangfire 1.8.14

The text was updated successfully, but these errors were encountered:

Relates to #2447

odinserj · 2024-09-30T09:25:01Z

Thank you so much for reporting this! And looks like I need help in validating my logic below, because my results don't make any sense.

Several years ago, in the early days of .NET Core, I began to suspect there are problems with waiting on CancellationToken.WaitHandle and implemented a workaround with a custom WaitHandle object in the commit 140b92f. Some month ago I removed that logic, thinking there should be no problems with that property anymore, since .NET Core is now much more stable, but decided to keep some reinsurance logic that will help to diagnose this case without causing CPU to run at 100% due to constant retries.

The following piece of code performs waiting on a cancellation logic between some retry attempts, and calls the WaitHandle.WaitOne method with a timeout value on a CancellationToken to do so with the possibility to prevent delays during shutdown. All the other logic is to ensure that WaitOne is working fine.

var stopwatch = Stopwatch.StartNew();
var waitResult = cancellationToken.WaitHandle.WaitOne(timeout); // The main line
stopwatch.Stop();

var timeoutThreshold = TimeSpan.FromMilliseconds(1000);
var elapsedThreshold = TimeSpan.FromMilliseconds(500);
var protectionTime = TimeSpan.FromSeconds(1);

if (!cancellationToken.IsCancellationRequested && // Checking `cancellationToken` hasn't been canceled
    timeout >= timeoutThreshold && // Checking `timeout` more than 1 second
    stopwatch.Elapsed < elapsedThreshold) // Checking less that 500ms elapsed
{
    try
    {
        var logger = LogProvider.GetLogger(typeof(CancellationTokenExtentions));
        logger.Error($"Actual wait time for non-canceled token was '{stopwatch.Elapsed}' instead of '{timeout}', wait result: {waitResult}, using protective wait. Please report this to Hangfire developers.");
    }
    finally
    {
        Thread.Sleep(protectionTime);
    }
}

From the messages above we see that the result value of WaitOne is false, and that means that CancellationToken instance wasn't canceled, e.g. the corresponding WaitHandle instance hasn't been signaled (and IsCancellationRequested check confirms this).
WaitOne was called with at least 1-second timeout value, according to the log. It is a positive value, not TimeSpan.Zero, so according to MSDN and common sense, the calling thread should be blocked until the timeout occurs.
But the message says that stopwatch.Elapsed was less than 500ms.

So it's either my logic is wrong (probably), detection logic is wrong (probably) my understanding of WaitOne method's behavior is wrong (probably), or WaitHandle.WaitOne method is wrong that's totally improbable – this method has so many usages and expected to be battle tested everywhere.

I would appreciate any help in understanding what's wrong here and in more reports of this behavior.

In version 1.8.15 I will improve the precision to know the exact timings in milliseconds, since TimeSpan.ToString precision is not enough here.

ManuelHaas · 2024-10-04T20:51:05Z

As far as I can tell, the error only occurs when the server is under heavy load or the host system of the virtual machine is causing problems.
A little over two weeks ago, the host system of the virtual machine started to cause problems.
On the instance, the error message occurred multiple times every day for 12 days. Three days ago, I moved the instance to a new server and the error message has not occurred since.

odinserj added a commit that referenced this issue Sep 30, 2024

Improve precision of some diagnostic messages

d9d6748

Relates to #2447

odinserj added help wanted a: core t: problem labels Sep 30, 2024

odinserj added this to the Hangfire 1.8.16 milestone Nov 15, 2024

odinserj closed this as completed in 6316cac Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447

Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447

ManuelHaas commented Sep 26, 2024

odinserj commented Sep 30, 2024

ManuelHaas commented Oct 4, 2024

Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447

Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447

Comments

ManuelHaas commented Sep 26, 2024

odinserj commented Sep 30, 2024

ManuelHaas commented Oct 4, 2024