Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning once message to detect rare circumstance of negative main delta #31900

Closed
wants to merge 1 commit into from

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Sep 2, 2019

On some hardware / OSes calls to the timing APIs may occasionally give negative deltas in the main loop. This can cause bugs where time is assumed to always goes forward. This PR simply adds a warning once message if this situation is detected. No control flow is changed.

Related to #31837, #25166, #26887. Mentioned in #31016.

core/engine.cpp Outdated
{
// prevent clock going backwards due to OS timer bugs
if (os_time > _frame_ticks)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep sure I'll fix these with the beautifier plugin, it is just WIP for now as it needs thorough testing. 👍

@lawnjelly lawnjelly force-pushed the timing2 branch 2 times, most recently from 5b38477 to d58724a Compare September 2, 2019 16:20
@lawnjelly lawnjelly requested a review from hpvb as a code owner September 2, 2019 16:20
@lawnjelly lawnjelly force-pushed the timing2 branch 2 times, most recently from 5b54f9d to d477a53 Compare September 2, 2019 17:34
@lawnjelly lawnjelly changed the title [WIP] Prevent main iteration timer having negative deltas Prevent main iteration timer having negative deltas Sep 3, 2019
@lawnjelly
Copy link
Member Author

The problem

Timing API functions on some OSes and hardware and situations do not give the expected behaviour that measured time always increases.

See:
https://forum.libcinder.org/topic/getelapsedseconds-sometimes-goes-backward-on-windows
https://forum.kodi.tv/showthread.php?tid=43274
etc

This can cause problems where Godot code assumes increasing time, especially where unsigned math is used to find the delta, e.g.:

uint64_t delta = (uint64_t) time_curr - (uint64_t) time_prev;

Where a negative delta cannot be represented and appears as a huge positive delta. Also passing negative deltas to _process is not expected.

The fix

In the main::iteration after OS::get_ticks_usec() is called, the value is sanitised to remove negative deltas.

The logic is basically:

 If delta > 0:
    // use delta as is
else:
   delta = 1; // or zero, but puthre points out that some code may not expect a zero delta

Resyncing

A slight alternative logic is that if the time measured is in the past, the next delta is measured relative to that 'erroneous' value:

uint64_t running_total = 0;
uint64_t time_prev = 0;

void update(uint64_t measured_time)
{
	if (measured_time > time_prev)
	{
		// if time is moving forward
		running_total += (measured_time - time_prev);
	}
	else
	{
		// if time is moving backward
		running_total += 1;
	}
	
	// resync and base the next delta on the previously measured time
	time_prev = measured_time;
}

As far as I can see this resyncing is undesirable unless needed to workaround OS bugs. It has the potential to make our timing jitter more than is necessary, but would be easy to swap if there are problems with the simpler approach.

Alternative approaches

I first tried to address this problem by modifying the OS::get_ticks_usec() function in #31863. However that approach proved problematic, because this function is used liberally throughout Godot and exhibited a high frequency of backward timing errors on all hardware / OS tested. See that PR for details.

What do other people do?

The literature suggests that the most reliable solution may be to reserve a CPU core for the timing thread (presumably the main thread in our case), and to sanitise the values. We could consider the former as well in the future if it proves necessary.

Notes

This backward time bug potentially has implications in other places throughout Godot (including within user scripts).

I am now of the impression that the most robust solution to these problems is for most code in need of a timer to utilize _frame_ticks in Engine (which is set during this PR), i.e. use one sanitised frame timer rather than make several independent calls to the underlying OS which will be subject to the bug.

This is not addressed in this PR but is something that can be discussed and potentially changed in other code in another PR.

Note also that this is a particularly sinister bug because it may not exhibit in all situations, or on all hardware (so is difficult for developers to debug), and to future proof us from future issues due to this, it is worth coming to some kind of robust solution rather than simply consider spurious crashes acceptable further down the line.

@lawnjelly lawnjelly changed the title Prevent main iteration timer having negative deltas [WIP] Prevent main iteration timer having negative deltas Sep 3, 2019
@Faless
Copy link
Collaborator

Faless commented Sep 3, 2019

Given a cycle like:

var start = OS.get_ticks_usec()
while OS.get_ticks_usec() - start < 10:
   print("Some ugly busy wait")

Wouldn't that be an endless loop? I know this is weird, but also quite counter-intuitive given the function name.

Now to the actual issue, I was wondering if it's a threading problem in the end.
OS time functions are not thread safe, we don't protect them with a mutex, and given that it might be called from different threads that might be the issue.

@lawnjelly
Copy link
Member Author

Given a cycle like:

var start = OS.get_ticks_usec()
while OS.get_ticks_usec() - start < 10:
   print("Some ugly busy wait")

Wouldn't that be an endless loop? I know this is weird, but also quite counter-intuitive given the function name.

Yes, absolutely if you mean the suggestion of trying to move as much as possible to using a once off frame timer, so the name of such a function would probably have to be different, e.g. get_frametime_usec(). But that is really a discussion for another PR if it proves necessary.

Now to the actual issue, I was wondering if it's a threading problem in the end.
OS time functions are not thread safe, we don't protect them with a mutex, and given that it might be called from different threads that might be the issue.

I did see your comment on the irc logs, and tried to check this out via google and you may well be right on the OS time functions needing a mutex. However the small test I did with puthre, I did wrap with a mutex and the issue still occurred. This was not rigorous though, so I'll try and do some more thorough testing when I get back to ensure it isn't just something a lack of mutex is causing (I am away from home for a few days, and have marked this as WIP). Luckily the issue is reproducible on my kaby lake test machine.

The literature on this issue of the hardware timers seems to be vast and goes back years and is a veritable can of worms with different hardware using different timers (and it seems you can switch between them on linux) - HPET, acpi_pm, tsc, constant_tsc, nonstop_tsc etc, with much argument among hardware guys amongst themselves about whether such errors are possible (!!) .

My view is that given our timing code currently cannot tolerate backwards timing, we should probably put in the logic to prevent this situation, even if we are not sure whether it should really occur in the wild. A couple of lines of error prevention with no cost to us seems like a no brainer to close the main issue (much as we'd think nothing of having e.g. an assert in the code).

@Faless
Copy link
Collaborator

Faless commented Sep 4, 2019

My view is that given our timing code currently cannot tolerate backwards timing, we should probably put in the logic to prevent this situation, even if we are not sure whether it should really occur in the wild. A couple of lines of error prevention with no cost to us seems like a no brainer to close the main issue (much as we'd think nothing of having e.g. an assert in the code).

I disagree on this point, if we can be sure that the error cannot happen we should not add useless code. What's being suggested here is not "A couple of lines of error prevention with no cost to us" but a radical change to how the function work. Sadly, I haven't been able to reproduce this error so it's hard for me to test.

Can you send me the project you are using to test this?
Can you try this small patch? (I'll have to double check if using a global lock is fine here, but at least we can test if mutexes works in an easy way)

diff --git a/drivers/unix/os_unix.cpp b/drivers/unix/os_unix.cpp
index ab5590dba4..fea209a20f 100644
--- a/drivers/unix/os_unix.cpp
+++ b/drivers/unix/os_unix.cpp
@@ -260,6 +260,7 @@ void OS_Unix::delay_usec(uint32_t p_usec) const {
 }
 uint64_t OS_Unix::get_ticks_usec() const {
 
+       GLOBAL_LOCK_FUNCTION
 #if defined(__APPLE__)
        uint64_t longtime = mach_absolute_time() * _clock_scale;
 #else
diff --git a/platform/windows/os_windows.cpp b/platform/windows/os_windows.cpp
index be325381bb..32c6b0e37e 100644
--- a/platform/windows/os_windows.cpp
+++ b/platform/windows/os_windows.cpp
@@ -2286,6 +2286,8 @@ void OS_Windows::delay_usec(uint32_t p_usec) const {
 }
 uint64_t OS_Windows::get_ticks_usec() const {
 
+       GLOBAL_LOCK_FUNCTION
+
        uint64_t ticks;
        uint64_t time;
        // This is the number of clock ticks since start

@ghost
Copy link

ghost commented Sep 5, 2019

Sadly, I haven't been able to reproduce this error so it's hard for me to test.

The dreadful part of it is that it seems many of the causes might be from specific hardware circumstances. I haven't had negative deltas in my years using Godot on my systems, but it is a source of anxiety to think about making a game for 1-2 years and it simply doesn't work for some % of people because of an unaddressed quirk like this.

That said, I'm glad you've taken an interest in the issue.

@lawnjelly
Copy link
Member Author

I haven't had negative deltas in my years using Godot on my systems, but it is a source of anxiety to think about making a game for 1-2 years and it simply doesn't work for some % of people because of an unaddressed quirk like this.

Realistically with multiplatform software, particularly things like games that use advanced graphics, there will nearly always be a percentage of people who it doesn't work for. On the plus side with an open source engine such as Godot which is used for lots of games, there are lots of people who can report issues and lead to them being fixed, much more so than with a bespoke engine. And something like this can usually be fixed by simply re-exporting your game with the latest version of the engine.

Also don't panic, the main negative delta issue was probably solved in an earlier PR changing to MONOTONIC / MONOTONIC_RAW clocks which aren't meant to be subject to backwards jumps due to NTP / adjusting clock / date (whereas other clocks can be).

I'm still stuck away from home, with motorcycle in the garage with a broken clutch cable, so until that's done I can barely access internet by tethering my phone to an old laptop, let alone run godot etc, even writing a post is difficult so I apologize if this is a bit rambling. I'm eager to do more testing, it would be nice if the more common occurrence is just a race condition in our code (or even the debugging code!), because that would be easy to fix. However I will warn that in some cases I don't think it is, the android phone we tested the other day #31837 was producing crazy timings (which prompted me to investigate whether this was a general issue worth dealing with).

I have been able to do a little reading into the hardware side. Caveat I am really not a hardware or OS guy and have a very limited understand of what goes on (and treat it as a black box usually), and may have misunderstood some of the following, so please don't rely on as gospel, it would be nice to have some input from an OS guy:

What seems to be a source of confusion is that docs are very reassuring when they say things like 'the monotonic clock only goes forward, by definition'. What they don't tell you is that the underlying mechanism for the clocks is actually a pretty complex (read error prone, if past bug fixes are anything to go by) beast, and has 2 parts, only part of which is the main highly tested operating system kernel (which does some things to the values that come from the hardware to e.g. synchronize with NTP time by slewing the values in the case of monotonic clock (it should not in theory jump, which can happen with the realtime clock for instance)), and also interpolates values actually coming from the hardware (see the coarse timers).

What isn't clear in the docs is that the other half of the equation is the figures coming in from the hardware, and the device drivers, which is the board support package (bsp) which may be customized for a particular combination of hardware.

https://www.quora.com/What-is-the-difference-between-BSP-and-Linux-Kernel

As I see it the kernel is a layer but it depends on a lower layer to talk to the hardware for certain basic tasks like provide the timers. So even if there are no bugs in the main linux kernel, there can either be bugs in the hardware, or bugs in the BSP that provides the timing values to the kernel, especially with rarer chipsets. The bsp for instance could have race conditions when accessing the timer.

The hardware also might have available several different clock sources and the OS can choose between them (you can force change the clock source in some cases). Some clock sources may have bugs and others may not.

To see what clock sources you have on your system:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource

To see which one is being used:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Anyway here are some links for those interested:
https://opensource.com/article/17/6/timekeeping-linux-vms
https://www.mail-archive.com/[email protected]/msg10999.html
https://lists.rtems.org/pipermail/users/2014-November/028362.html
https://lkml.org/lkml/2017/5/30/991
https://news.ycombinator.com/item?id=4130691
https://stackoverflow.com/questions/3657289/linux-clock-gettimeclock-monotonic-strange-non-monotonic-behavior
https://kernel.googlesource.com/pub/scm/linux/kernel/git/horms/renesas-bsp/+/rcar-3.0.1/Documentation/timers/timekeeping.txt
https://www.tomshardware.com/reviews/amd-ryzen-clock-bug-benchmark-scores,6312.html
https://stackoverflow.com/questions/27419223/time-running-backwards-with-queryperformancecounter
http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/

@lawnjelly
Copy link
Member Author

lawnjelly commented Sep 7, 2019

Finally, I've got back home and be able to test this properly. It turns out that @Faless is absolutely right (at least on my system), the problem seems to be stemming from the OS timing functions needing a mutex.

In a classic example of not having enough time to rigorously test (as I only had 5 mins before I had to leave earlier in the week and was running late) it turned out when I quickly added a mutex to test puthre's idea (that the debugging previous value needed a mutex), the mutex itself wasn't getting created properly in the OS constructor, looks like due to an order of construction problem (I'm not familiar with godot's mutexes etc), and an error was flagging in the logs but I hadn't seen it.

That mutex was meant to also be around the OS timing functions (so by accident testing them too), however after trying Faless' GLOBAL_LOCK_FUNCTION macro, this did the trick, and repeatedly solves the issue on mine. I did check out puthre's suggestion and it doesn't seem to be the debugging code giving a false error, the OS timing funcs themselves need a mutex, because if the debugging code is given the mutex and the timing funcs not, then the problem still occurs.

So after this rather wild goose chase on my part, I am glad something good has come out of it, there does actually seem to be a problem. 😄

As to fixing it, Faless suggestion seems to be bang on. I'll change this PR for now, or Faless you could create a new one? If GLOBAL_LOCK_FUNCTION is too much of a sledgehammer, I could do with some guidance how best to use e.g. godot mutex as I'm not familiar with what synchronization primitives are available in godot and how best to use them (e.g. when mutex becomes available to use, but I can probably figure this out).

Another gotcha to mention is that given there might be an order of construction issue with the mutex, the possibility of the timing code being called BEFORE the mutex is active. Probably not a problem, as it is the main loop delta that is the biggest consideration, as other calls are likely to be cosmetic, but just something for us to check.

I would now be interested to check again #31837 with this fix. Once we have a reasonable implementation I'll see if @xuvatilavv is available again for some testing with his device (it's a bit more involved because the android templates need compiling etc).

More Info

Occurs in all godot projects (just create a new project and add a spatial and run)

Test patch to show backward timings on systems where it happens:
https://github.com/lawnjelly/godot/tree/timing-test

Note this test is only for non-windows OS. It should debug print if there are backwards timings, and if you uncomment:
#define TIMING_MUTEX_ALL
in OS_Unix::get_ticks_usec
this will wrap the timing call in the mutex, which cures the issue on my system.

It isn't as compact as I'd like and doesn't delete the mutex, but I had to change the timings calls from const to non-const to store the previous values.

… loop delta

A simple warning for erroneous backward timing that may occur on some hardware. This will give some idea of whether this is a widespread issue, and allow easier diagnosis of timing problems should they occur on the issue tracker.
@lawnjelly lawnjelly changed the title [WIP] Prevent main iteration timer having negative deltas [WIP] Warning once message to detect rare circumstance of negative main delta Sep 7, 2019
@Faless
Copy link
Collaborator

Faless commented Sep 7, 2019

I did some research, clock_gettime seems to be thread safe according to linux man:
http://man7.org/linux/man-pages/man3/clock_gettime.3.html#ATTRIBUTES

Very weird...

@lawnjelly
Copy link
Member Author

Ok, my current thinking is that despite confirming that backward timing is a potential issue, I'm hoping it shouldn't currently affect the main loop, as it is running from a single thread and the delta is quite large relative to the discrepancies (except in the case of more severe hardware / kernel bugs). For this reason I've reverted this PR to just print a warning once if it detects backward timing, which I think might be useful to have for diagnostic reasons.

I have also prepared a PR that wraps the calls to the timing APIs in a mutex (following Faless suggestion), which seems to remove the issue on my main system (Intel core 55-7500T CPU, Linux mint 18.2 64 bit), however unless Faless or any others think we should put that in, I am happy to holding off on that and keep on watch for issues caused by it.

The backward timing values are small (mostly 200 or less usecs). On top of that even if we place a mutex around the actual API call, there can still be a race conditions I think:

e.g.

Thread A -> What is the time? 10pm
Thread A-> Do some calculations
Thread B-> What is the time? 11pm
Thread B-> Do some calculations
Thread B-> Store result
Thread A-> Store result

Function A could still end up finishing after function B despite starting earlier.

So I am not convinced there is a totally generic way of solving the phenomenon, perhaps we just have to write code to deal with this possibility. And moving as much to using a single frame timer would solve much of this anyway.

@lawnjelly
Copy link
Member Author

I did some research, clock_gettime seems to be thread safe according to linux man:
http://man7.org/linux/man-pages/man3/clock_gettime.3.html#ATTRIBUTES

Very weird...

Yes I'd read some sources saying it should be multi-thread safe (and others not sure). It may be down to the very nature of time, by the time you've read a value it has changed. A 'heisenbug'! 😄

https://stackoverflow.com/questions/570353/hardest-types-of-bugs-to-track

Timing in general tends to be a difficult topic to understand (probably because as humans we have a tendency to think in realtime, whereas games simulations are really 'pre-rendered' but just at high speed trying to keep up, and you need a different mindset), and this issue has definitely been driving me loopy!

@lawnjelly lawnjelly changed the title [WIP] Warning once message to detect rare circumstance of negative main delta Warning once message to detect rare circumstance of negative main delta Sep 8, 2019
@lawnjelly lawnjelly closed this Sep 27, 2019
@lawnjelly lawnjelly deleted the timing2 branch October 2, 2019 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants