Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevents OS::get_ticks_usec() from going backwards in time due to OS … #31863

Closed
wants to merge 1 commit into from

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Sep 1, 2019

…bugs.

On some hardware / OSes calls to timing APIs may occasionally falsely give the impression time is running backwards (e.g. QueryPerformanceCounter). This can cause bugs in any Godot code that assumes time always goes forwards. This PR simply records the previous time returned by the OS, and returns the previous time if the new time is earlier than the previous time.

May fix #31837, #25166, #26887. Mentioned in #31016.

It does this by changing the original OS function names from get_ticks_usec() to get_ticks_raw_usec() (and making them protected to emphasise not to call them) and making the get_ticks_usec() function into a wrapper which enforces the logic.

This way none of the other Godot code needs to be changed, and it can assume that deltas will always be positive, rather than having each bit of code deal with this possibility.

@lawnjelly lawnjelly requested review from hpvb and vnen as code owners September 1, 2019 10:55
@lawnjelly
Copy link
Member Author

lawnjelly commented Sep 1, 2019

Note that you can put a WARN_PRINT to show when time is detected going backwards:

uint64_t OS::get_ticks_usec() {
	// This is a workaround for rare OS bugs where a call to get_ticks_raw_usec()
	// may return an EARLIER value than that given in a previous call.
	// This ensures that the minimum delta will be zero, and prevents bugs which rely on
	// the delta being positive.
	uint64_t t = get_ticks_raw_usec();
	if (t >= _ticks_usec_prev)
		_ticks_usec_prev = t;
	else {
		WARN_PRINT("OS time running backward detected");
	}

	return _ticks_usec_prev;
}

It actually occurs reasonably frequently on my Linux Mint, usually when switching between tasks rather than during gameplay. This may be affecting the CPU core that the timer is queried on, which has been cited as one of the causes of this kind of problem.

See:
https://forum.libcinder.org/topic/getelapsedseconds-sometimes-goes-backward-on-windows
https://forum.kodi.tv/showthread.php?tid=43274
etc

This is causing problems further down the line, particularly where code is using unsigned math to compare between current and previous timing values, where a negative delta is interpreted as a huge positive difference.

According to some articles, it may be possible to help alleviate this on some OSes by trying to reserve a particular core, however this may be unnecessary for Godot if it only occurs on things like task switching, which shouldn't affect gameplay.

Note also that it is also possible to use a slightly more complex logic, whereby the godot clock 'resyncs' when it detects time going backwards, however hopefully that should not be necessary.

@Chaosus Chaosus added this to the 3.2 milestone Sep 1, 2019
@YeldhamDev
Copy link
Member

These changes should be explained in their method descriptions in the docs.

@lawnjelly
Copy link
Member Author

These changes should be explained in their method descriptions in the docs.

Ah good point I'll try and check this later. 👍 Shouldn't be anything major, get_ticks_usec() and get_ticks_msec() become non-const. Other than that the functionality is the same, is just a bug fix.

Hopefully the internal get_ticks_raw_usec() and _ticks_usec_prev can be hidden from the docs (not sure how doctool works yet, will have to test).

@xcrdev
Copy link

xcrdev commented Sep 1, 2019

Do the OSes not have APIs to get monotonic time?

@lawnjelly
Copy link
Member Author

Do the OSes not have APIs to get monotonic time?

Yes, however, afaik since things like multicore CPUs and speedstep, old assumptions about timing no longer hold, and these functions aren't guaranteed to give you accurate results. Just google for 'QueryPerformanceCounter + backwards' for more articles like those linked for more info.

@lawnjelly
Copy link
Member Author

Just a note to hold on merging this while I do some testing on #31837, it may need the resync modification I mentioned earlier.

@ghost
Copy link

ghost commented Sep 2, 2019

@lawnjelly Interesting. Thanks for sharing the details behind this problem. There is quite a can of worms behind it.

Reading some articles out there, the recommendations seem to be clamping like what is done in this PR, and then having the counter operate on a thread that has a processor affinity. To handle situations when the issue is related to multi-core hiccups.

I still know very little about Godot's backend, but if part of the issue is the result of multi-threading on certain hardware/OSes, shouldn't this also include making the tick query execute in a thread that has been assigned to a single processor? Does the engine's main thread have a processor affinity?

…bugs.

On some hardware / OSes calls to timing APIs may occasionally falsely give the impression time is running backwards (e.g. QueryPerformanceCounter). This can cause bugs in any Godot code that assumes time always goes forwards. This PR simply records the previous time returned by the OS, and returns the previous time if the new time is earlier than the previous time.

May fix godotengine#31837, godotengine#25166, godotengine#27887.
@puthre
Copy link
Contributor

puthre commented Sep 2, 2019

I don't think you should allow a diff of 0 since the previous call as users expect time to pass between calls and might try to divide by this difference resulting in very hard to reproduce bugs.

@lawnjelly
Copy link
Member Author

I don't think you should allow a diff of 0 since the previous call as users expect time to pass between calls and might try to divide by this difference resulting in very hard to reproduce bugs.

Good point, maybe we can have a minimum value. Do you reckon 1 usec would be okay, considering it might be converted to float .. and possible precision errors?

@lawnjelly
Copy link
Member Author

I still know very little about Godot's backend, but if part of the issue is the result of multi-threading on certain hardware/OSes, shouldn't this also include making the tick query execute in a thread that has been assigned to a single processor? Does the engine's main thread have a processor affinity?

I'd read this too, but this is well outside my small realms of knowledge. This may well be something worth experimenting with if we don't do it already (bearing in mind there may be performance consequences of reserving a core if Godot is running in the background). However it is unclear to me whether this may be an issue on some hardware but not others, depending how it does timing.

This thread for instance suggests that the linux kernel now updates some values in userspace regularly that are cheap to read, which I don't know how that ties in with the idea of swapping cores affecting the value read:
https://discourse.libsdl.org/t/android-timer-precision-increase-1000x/21376/6

I think we need a more hardware familiar guy to help us make decisions on that front.

In my mind though the first step is to make sure we have as robust a timer as possible, that can cope with violations of assumptions (such as time always going forward), and that is something we can do.

@lawnjelly
Copy link
Member Author

I still know very little about Godot's backend, but if part of the issue is the result of multi-threading on certain hardware/OSes, shouldn't this also include making the tick query execute in a thread that has been assigned to a single processor? Does the engine's main thread have a processor affinity?

Am now suspecting this is an issue, certainly with the current fix here, and I will close this PR and come up with an alternative way of doing this.

Currently get_ticks_usec() may be called from multiple threads from Godot (don't know that it is, but it could be), which may lead to time shifts each time if these are on different cores. The main bug scenario concerns the call from main::iteration which drives the main delta. So rejigging the synchronization via a wrapper in other threads could mess with this.

To start with I will try and solve the main::iteration backwards timing issue, probably by making a central single call to the OS API timing function once per frame, from the main thread, and ensuring this is synchronized as best we can.

It is also probably best if in as many circumstances as possible throughout Godot and plugins we don't call the OS function, we instead use our locally stored, per frame timer. Thinking about it, this is what I normally see in games.. typically you want access to 2 times:

  1. Frame time
  2. Logical physics time

In Godot however a file in files for get_ticks_usec shows it being used from approx 30 different files. Maybe we need to change our timing paradigm slightly to properly deal with this problem, and use indirect queries to our own stored timing values.

Other uses for timers are for profiling, in which case the vagueness of the current situation might be okay. I am not sure about the other cases, and how crucial it is on a case by case basis to deal with these issues.

@ghost
Copy link

ghost commented Sep 2, 2019

I can only imagine, as the other issues about delta suggest, that the processing loops would be impacted by this. So without a reliable delta across all systems, there would be systems where games would be unplayable. Everything could potentially break if a player's system bumps into these conditions...

Sounds like it could be a very critical problem. 😟

@lawnjelly
Copy link
Member Author

I can only imagine, as the other issues about delta suggest, that the processing loops would be impacted by this. So without a reliable delta across all systems, there would be systems where games would be unplayable. Everything could potentially break if a player's system bumps into these conditions...

Sounds like it could be a very critical problem.

Yes absolutely but it should be relatively easy for us to fix. The main problem existing now is the timing paradigm used within Godot, time is measured via the OS API in multiple places throughout godot, in scripts, and presumably in plugins etc.

This should be changed imo. 😐 In most cases what is needed is not 'the time at which that particular instruction runs', but a global frame time, which already exists in Engine::_frame_ticks (but is not really used). We need an explicit split between 2 functions:

  1. get_frame_time
  2. get_some_wildly_inaccurate_guess of what the CPU time is right now while calling this instruction

If we were feeling cheeky we could simply change the get_ticks_usec function to instead return the frame time, and change this in the docs, then have a separate function to query the current time. But there is the possibility this might be compatibility breaking in some cases.

Or perhaps more likely we could have Engine::_frame_ticks accessible from script etc and give it a more sexy name, and encourage users to use this rather than the old function (perhaps emphasize this in the docs, or make the function deprecated and add a one off warning message). It's one of those things that is easy to change in the godot source code but more difficult to change habits in users (people writing games and modules etc).

@Faless
Copy link
Collaborator

Faless commented Oct 3, 2019

To anyone that can reproduce the original issue
can you try this branch?
https://github.com/Faless/godot/tree/spike/clock_info
Is it still happening there?

The patch just tries to avoid some unnecessary math, limiting potential overflows and hopefully minimizing numerical errors:
master...Faless:spike/clock_info

(there is also a 3.1 version of this patch: https://github.com/Faless/godot/tree/spike/clock_info_3.1 )

@lawnjelly
Copy link
Member Author

I will try but it will have to be next week. Although I don't think the effects were due to numerical error converting to usecs.

I gave up the investigation in the end, but to summarise:

  1. In Inaccurate timing behavior on LineageOS #31837 when we looked at it, the numbers were WAY off, which led me to believe it was actually a bug in that OS / hardware combo, and xuvatilavv mentioned that it wasn't an officially supported combination.
  2. When testing on my own machine, I was getting occasional very small negative timings. However despite wrapping things in mutexes etc, I couldn't guarantee you 100% that there wasn't a flaw in my measuring method (the very nature of time makes this quite tricky, by the time you have measured it, the value is out of date etc).
  3. Despite the 'guarantee' from the docs that MONOTONIC_RAW clock should never go backward, the low level implementation details seemed quite error prone on some hardware, so it entirely feasible that this could arise (unless there are specific checks in the kernel to prevent this possibility).
  4. However, aside from performance timers (which aren't critical) the main important call to get time was once per frame, which was far outside the range of being susceptible to the kind of differences I was seeing. So I figured that even though it may be occurring, it was unlikely to be a problem in practice for us.

I suspect that the earlier issues that reported backward timing were due to use of CLOCK_MONOTONIC rather than CLOCK_MONOTONIC_RAW which was subject to NTP adjustment, which could cause this kind of issue. But I think that was fixed in #22424.

The negative timings I was getting could be cured by putting a mutex around the clock_gettime call.

Just found this which is indicative of at least some of the issues that used to exist (whether they have been cured or not now I don't know). Although maybe not the problem on my kabylake CPU as it has constant_tsc, tsc_adjust, rdtscp, tsc_known_freq, nonstop_tsc.

https://linux.die.net/man/3/clock_gettime

NOTE for SMP systems
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.

If the CPUs in an SMP system have different clock sources then there is no way to maintain a correlation between the timer registers since each CPU will run at a slightly different frequency. If that is the case then clock_getcpuclockid(0) will return ENOENT to signify this condition. The two clocks will then only be useful if it can be ensured that a process stays on a certain CPU.

The processors in an SMP system do not start all at exactly the same time and therefore the timer registers are typically running at an offset. Some architectures include code that attempts to limit these offsets on bootup. However, the code cannot guarantee to accurately tune the offsets. Glibc contains no provisions to deal with these offsets (unlike the Linux Kernel). Typically these offsets are small and therefore the effects may be negligible in most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inaccurate timing behavior on LineageOS
7 participants