-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std/cputicks: high resolution low overhead strictly monotonic cpu counter #411
Comments
Is it really necessary to add a new module for this? This is one of the things where I actually agree that they should be in the standard library, but why do we have to make a new module for every other new function? For example, possible candidates would be IMO, the standard library is already bloated with unnecessary (for a standard library, not in general) modules, multiple modules only export a single function and have so for years (but they can't be deprecated, because people use them). This is not meant as a criticism of this particular suggestion, but rather of the attitude of adding a new stdlib module for every function that seems occasionally useful. |
I agree with the idea and I hope it can be added to StdLib, |
both monotimes and times carry huge dependencies with them as can be seen with
whereas std/cputicks carries 0 depdendencies. modules are a cheap abstraction, see benchmarks in nim-lang/Nim#14614; we can always decide later on to import/re-export std/cputicks in std/monotimes if needed as an alternative way to access those APIs, but allowing to use std/cputicks without adding a dependency on all those modules makes sense IMO. |
Ok, and why should this dictate how modules are organized? Assuming the reason is related to performance, isn't the whole point of the caching and incremental compilation mechanisms to handle large modules with many dependencies? Dedicating an entire module to such a narrow degree of functionality unnecessarily complicates the standard library's documentation and architecture. This functionality should be incorporated into another module, such as I'm also unconvinced that this has any substantial benefit over existing timing methods. Elsewhere, there is the assumption that RDTSC is strictly monotonic (a trait other timing methods lack). Is this true? Even on multiprocessor systems? I'd prefer to see explicit assertions from multiple reliable sources on this. |
if you fail to see the point of low overhead and high resolution for measuring performance, I'm not sure what will help. See also [1]. TSC are widely used for this very purpose. See for eg https://software.intel.com/content/www/us/en/develop/articles/best-timing-function-for-measuring-ipp-api-timing.html, https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf, http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/, or any other resource on this topic. In particular, the generic nim profiling instrumentation I implemented on top of nim-lang/Nim#15827 relies on this functionality and would give much less accurate results with existing timing methods.
See for example https://stackoverflow.com/questions/65975629/does-rdtscp-increment-monotonically-across-multi-cores, or https://stackoverflow.com/questions/10921210/cpu-tsc-fetch-operation-especially-in-multicore-multi-processor-environment
|
|
https://gist.github.com/timotheecour/e5d85be09b141b4bf9f877e1c8731025
RDTSC gives 0.00069 => 520X less overhead this is on docker though, you'll likely get different results running on linux directly on osx, with
=> 2.2X less overhead compared to a raw call to
yes if you care about measure performance, profiling, etc. Any modern CPU will have some TSC available,
within same CPU will still be monotonic.
RDTSCP is newer and adds some overhead compared to RDTSC, depending on use cases may or may not be appropriate. A good API will offer both to accomodate use cases. For a profiling use case with nested calls, overhead is a key concern because the overhead affects timing in callers.
std/cputicks can grow functionality as needed and be imported/re-exported if needed in other modules, whereas moving it in monotimes would carry too many dependencies as shown in #411 (comment) |
given the changing nature of hardware support and instructions, this is clearly something that could be implemented in a stand-alone library, with great benefit to the community: it could have a release cycle disentangled from the Nim compiler releases - new hardware support could be added quicker etc. |
Ok, let's start from the top. Please note that these are not all criticisms against using RDTSC. Some of these are criticisms about how RDTSC is proposed to be used, or regarding the arguments in favor of it. From comment 1
Micro-benchmarks are nearly (if not always) impossible to write correctly without repetition. In modern SMP systems there is simply too much "noise" (processor scaling, OS paging, background processes) to get a clean result in one execution. This is setting aside the fact that the usefulness of micro-benchmarks is often questionable.
This is the one situation where I think RDTSC is useful. Fine-grained profiling tends to involve lots of timing calls, and unlike benchmarks, there is a need for the program to run as fast as it normally would.
Aside from the previous point, note this rather explicit disclaimer at the top of the linked benchmark:
From comment 4 and comment 8, respectively:
Ok, and why are these dependencies a problem? And is this problem significant enough that it can't be solved through improving Nim's implementation? Keep in mind that whatever problem this is must be weighed against the costs of adding a new standard library module. A new standard library module increases the complexity of reading and learning the standard library's documentation, as well as maintaining the standard library. From comment 6:
How much less accurate? Is the loss in accuracy substantial enough to have a noticeable, consequential impact? From comment 8:
General RemarksGiven how the characteristics of high-frequency counter instructions can vary across architectures and operating systems, guaranteeing strict monotonicity for any sort of That being said, high-frequency counter instructions have their use-cases. They just need to be used judiciously and with caution, where OS-provided functions cannot be reasonably used. They are not a general replacement for said OS-provided functions. I also do not believe that this should be part of a single module. The functionality provided by this function is small, and [1] https://software.intel.com/content/dam/develop/external/us/en/documents-tps/325462-sdm-vol-1-2abcd-3abcd.pdf |
I too, fully agree that addition of new modules should be carefully considered. We already have tons of small modules that declare one-two procs, and while it is hard to argue that "modules are cheap abstraction" ... from implementation point of view, but I think it is easier for people to discover necessary functionality in already existing modules.
And some other, like typetraits (why not These modules are added, never expanded (because they were initially created with extremely narrow scope in mind) and people have to either scan through full list of modules, or hope that someone points out necessary module (in forum post, chat discussion etc.). |
The small modules are perfectly fine IMHO but we should not add what is already available as a Nimble package. Like benchmarking frameworks. |
of these, only 2 packages mention rdtsc:
Instead of reinventing the wheel in each benchmarking package, this PR enables having a dedicated module to implement this properly and provide the best possible cross-platform support (wrapping platform specific TSC's as needed for other platforms), that other benchmarking packages can reuse and count on. Furthermore, |
So provide a Nimble package as the foundation for these other Nimble packages and use "atlas" to be able to use it in Nim's core. I wrote Atlas for a reason. |
It must be usable inside stdlib. atlas is far from production ready, see all issues I ran into: https://github.com/nim-lang/Nim/labels/atlas; I plan to release some fixes but it's still far. The end goal would be that Furthermore, a pkg/cputicks package wouldn't help with VM code; see for example https://forum.nim-lang.org/t/5261#52440 and other threads which requests ability to benchmark VM code. With this PR, |
Except that:
[0] https://documentation-service.arm.com/static/611fa684674a052ae36c7c91 |
good point, I've pushed a commit to clarify this
good research, I've rephrased things a bit/fixed a reference and added it as a comment, and added some notes regarding future work support for this platform, see commit Incidentally, this feedback loop justifies the point of adding such functionality to stdlib; the review process can only improve things and lead to better design/APIs/choices, esp for handling cross-platform details; a separate nimble package would be unlikely to get the same attention and improve over time into a comprehensive cross-platform optimized solution.
the main thing that stops me from placing this in std/monotimes is dependency on std/times, which carries itself tons of dependencies; if that dependency were removed, std/cputicks would be fine to move to std/monotimes; this is not simple however, as this itself requires non-trivial tasks in order not to end in a bad design:
A separate std/cputicks sidesteps all these issues and doesn't prevent an import/re-export of std/cputicks from std/monotimes anyways, which completely addresses the point regarding discoverability. |
I mean, this is quite a ridiculous statement given the large amount of incredible libraries that exist out there - both nim, and other languages. The fact that unmaintained libraries exist out there does not mean that maintained libraries do exist - that's not a logical leap you can take, sorry. Arguably, many libraries out there are written because the stdlib is unmaintained in certain areas. You could argue that the attempt to add the code to stdlib without doing proper research prompted someone to look at the problem, but that doesn't sound like a sustainable development model - disproving potentially dubious code is quite a drain - this is one of the reasons why it makes sense to develop things outside a common repository first, so that people interested in the problem can refine the solution - when/if it reaches a certain quality level, it can be considered for inclusion. Imagine how many fewer "let's break common stuff because I don't like how the previous author wrote it" discussions we would have then. |
Only marginally touched upon by @Varriount & TC, but important here is that it is important to not confuse the "units of measurement" or maximum resolution with what they measure and full understanding of this really discounts the utility of "clock cycles". There is both the duration of execution (via serialiazaton) and the "error" (via 12-24 cycle pipeline depths) of flushing pipelines. This creates a situation where you still have quite a bit of measurement error - on the order of at least 3-6 ns on ~5GHz CPUs. You can (maybe) not serialize, too, but then there is a lot of error on the "what" you are measuring (two time points in the middle of a 100s of instruction re-order buffer). Either way, 1 ns "reporting resolution" is really not bad compared to measurement error. Also, I believe not yet mentioned, the "CPU cycle counter" on Intel only "ticks" at the fixed frequency, even when all the cores are down clocked to 800 MHz or whatever. So, essentially the "cycle counter" has on Intel become a "real time" query since the Nehalem era of the late noughties. See this stack overflow thread which mentions checking out the turbostat source code in the Linux kernel. Given that, you might as well use regular time units like "nanoseconds". Tenths of a nanosecond might induce less rounding error, but see above, 1 ns is not so bad as-is. Ok - so real time nanoseconds is not bad in terms of either measurement error or semantics, but what about query performance? Don't know about MacOS/Windows, but Linux Looping back to measurement error, some might say that it is unreliable/unrealistic to expect the "sub-latency inner time-points" of two distinct calls to "line up" and assign as measurement error in the delta the entire latency of these calls which is >>1ns. In conclusion, as much as I used to be a fan of using cycle counts myself, the modern CPU world has evolved to the point where there is little to no meaningful advantage on Intel - either "cycle semantic" or "query performance". Unless you are measuring direct time deltas much greater than 100 cycles you are probably just kidding yourself as to meaningfulness of said time deltas. If you are not kidding yourself, 20 ns is still not so big compared to 3-10 ns errors (regardless of "units" being cycles or nanoseconds). I'm not exactly happy about this, but it is how it is. If you are riding the Intel rocket, one might paraphrase the immortal black comedy Dr. Strangelove, "Learn to Stop Worrying About Cycles and Love the Nanosecond". :-( :-) |
proposal
new module
std/cputicks
providing cpu instruction-level granularity, with much higher precision and lower overhead than eithertimes.cpuTime
orstd/monotimes
(and less module import depdendencies), in particular strict monotonic instead of monotonic countersThis is what should be used for profiling and micro benchmarks (as opposed to looping N times over some code, which can skew results in various ways, affect the cache, register allocation, etc);
note
example 1: benchmarking
getCpuTicks
has 4.5X less overhead thangetMonoTime()
and 71X less overhead thancpuTime()
, see https://gist.github.com/timotheecour/e5d85be09b141b4bf9f877e1c8731025 (-d:case1
); in other OS's, the gap is even largergetMonoTime()
is not strictly monotonic and can't be used in a meaningful way to measure code under a certain number of instructions (see-d:case2
)getMonoTime()
example 2: profiling
I'm using this in nim-lang/Nim#15827 in not-yet-pushed commits which adds full profiling capability with low overhead; it advantageously replaces nimproc and other platform specific profilers, it understands nim code and gives meaningful results wrt recursive function, exception handling, etc more details on this later
example 3: correctness for things like initRand, etc
allows fixing things like nim-lang/Nim#17898 with a strict monotonic counter and low overhead (nim-lang/Nim#18729 has more overhead and less resolution, in contrast); see nim-lang/Nim#18149 which uses std/cputicks to fix initRand in #17898
availability
rdtsc
is available on all x86 architectures (refs https://en.wikipedia.org/wiki/Time_Stamp_Counter) and there should be an equivalent instruction in other architectures; see cross-platform benchmark code here: https://github.com/google/benchmark/blob/v1.1.0/src/cycleclock.h#L116links
PR
implemented in nim-lang/Nim#18743
The text was updated successfully, but these errors were encountered: