Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers. #2447

Closed
ManuelHaas opened this issue Sep 26, 2024 · 2 comments

Comments

@ManuelHaas
Copy link

Hi,

I got the following entries in my log:

2024-09-26 02:26:04.697 +02:00 [ERR] Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.
2024-09-26 02:31:07.708 +02:00 [ERR] Actual wait time for non-canceled token was '00:00:00' instead of '00:00:01', wait result: False, using protective wait. Please report this to Hangfire developers.

I am reporting it here because that is how it is written in the message.

.NET 8.0.8
Hangfire 1.8.14

odinserj added a commit that referenced this issue Sep 30, 2024
@odinserj
Copy link
Member

Thank you so much for reporting this! And looks like I need help in validating my logic below, because my results don't make any sense.

Several years ago, in the early days of .NET Core, I began to suspect there are problems with waiting on CancellationToken.WaitHandle and implemented a workaround with a custom WaitHandle object in the commit 140b92f. Some month ago I removed that logic, thinking there should be no problems with that property anymore, since .NET Core is now much more stable, but decided to keep some reinsurance logic that will help to diagnose this case without causing CPU to run at 100% due to constant retries.

The following piece of code performs waiting on a cancellation logic between some retry attempts, and calls the WaitHandle.WaitOne method with a timeout value on a CancellationToken to do so with the possibility to prevent delays during shutdown. All the other logic is to ensure that WaitOne is working fine.

var stopwatch = Stopwatch.StartNew();
var waitResult = cancellationToken.WaitHandle.WaitOne(timeout); // The main line
stopwatch.Stop();

var timeoutThreshold = TimeSpan.FromMilliseconds(1000);
var elapsedThreshold = TimeSpan.FromMilliseconds(500);
var protectionTime = TimeSpan.FromSeconds(1);

if (!cancellationToken.IsCancellationRequested && // Checking `cancellationToken` hasn't been canceled
    timeout >= timeoutThreshold && // Checking `timeout` more than 1 second
    stopwatch.Elapsed < elapsedThreshold) // Checking less that 500ms elapsed
{
    try
    {
        var logger = LogProvider.GetLogger(typeof(CancellationTokenExtentions));
        logger.Error($"Actual wait time for non-canceled token was '{stopwatch.Elapsed}' instead of '{timeout}', wait result: {waitResult}, using protective wait. Please report this to Hangfire developers.");
    }
    finally
    {
        Thread.Sleep(protectionTime);
    }
}
  1. From the messages above we see that the result value of WaitOne is false, and that means that CancellationToken instance wasn't canceled, e.g. the corresponding WaitHandle instance hasn't been signaled (and IsCancellationRequested check confirms this).
  2. WaitOne was called with at least 1-second timeout value, according to the log. It is a positive value, not TimeSpan.Zero, so according to MSDN and common sense, the calling thread should be blocked until the timeout occurs.
  3. But the message says that stopwatch.Elapsed was less than 500ms.

So it's either my logic is wrong (probably), detection logic is wrong (probably) my understanding of WaitOne method's behavior is wrong (probably), or WaitHandle.WaitOne method is wrong that's totally improbable – this method has so many usages and expected to be battle tested everywhere.

I would appreciate any help in understanding what's wrong here and in more reports of this behavior.

In version 1.8.15 I will improve the precision to know the exact timings in milliseconds, since TimeSpan.ToString precision is not enough here.

@ManuelHaas
Copy link
Author

As far as I can tell, the error only occurs when the server is under heavy load or the host system of the virtual machine is causing problems.
A little over two weeks ago, the host system of the virtual machine started to cause problems.
On the instance, the error message occurred multiple times every day for 12 days. Three days ago, I moved the instance to a new server and the error message has not occurred since.

@odinserj odinserj added this to the Hangfire 1.8.16 milestone Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants