-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace GC and Scheduler #14599
Trace GC and Scheduler #14599
Conversation
Trace: - Fix measurements to use Time.monotonic - Fix printf escapes to represent unsigned integers Stats: - Calculate variance on-the-fly (rather than collecting each value) - Manual parse instead of regular expression (45% faster) - Avoid String allocations when possible (another 30% faster) - Add `--fast` option to skip the calculation min/max/mean/stddev (30% faster)
The actual trace logs are disabled by default and you must declare the `CRYSTAL_DEBUG` to enable each section (`gc`, `sched`). For example: ``` $ crystal build app.cr $ CRYSTAL_DEBUG=gc,sched ./app 2> >(gzip > trace.log.gz) $ zcat trace.log.gz | crystal tool trace - ```
So we match exact references to 'gc' and 'sched' and expect the delimiter to be a comma (','). This allows to remove the `strstr(3)` libc call that was previously introduced and the tokenization and comparison now relies on slices and pure crystal (no need to bring `strsep(3)` or `strtok(3)` libc calls).
Renames `print_error(fmt, *args, &)` as `printf(fmt, *args, &)` as it involves formatting a string, not printing the formatted string to the standard error. Also introduces `print_error(bytes)` to do the actual print to standard error (without formatting) so we can do both format then print to error in two steps. Since crystal printf is a subset of C printf and we don't support printing floats, the trace is changed to print durations as nanoseconds integer + fix the tick to be the monotonic time since the program started instead of the current monotonic time (should be more useful). Last but not least, fixes the 'crystal tool trace stats' command to parse an integer instead of float (removes dependency on `strtod(2)`) and to leverage StringPool instead of the fixed list of static strings, so adding or removing sections/actions won't affect the command.
src/crystal/system/print_error.cr
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a related changer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes:
-
I renamed
#print_error(fmt, *args, &)
as#printf(fmt, *args, &)
because it doesn't write to stderr, it's only formatting the string, and the caller is responsible from writing somewhere. -
I split
#print_error
as a couple methods: one to write to stderr and one to format and write, so we keep using the latter in the stdlib (no change) and I can use the former for writing the buffered string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is true regardless of this PR, so maybe we can extract it to its own PR?
raise RuntimeError.from_winerror("QueryPerformanceFrequency") | ||
end | ||
frequency | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because I call Time.monotonic
before calling __crystal_main
that would setup @@performance_frequency
(or set it up later). By making it a method, the @@performance_frequency
is initialized before the program starts (or it's zeroed) in BSS or DATA, and it works.
It's still imperfect since it can raise (same on unix)... and that would probably not behave well. Maybe we'd like to have a specific raise-free method that would return the total nanoseconds (or zero in case of failure). That could lead to less computations, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a .ticks
method that doesn't raise (not even on overflow) so tracing won't be affected + it directly returns the monotonic time in nanoseconds, which is simpler.
Also the win32 functions are guaranteed to never fail since Windows XP, and we only support Windows 7+ so I removed the error checks.
Would this fit better in crystal-lang/perf-tools, or does it touch so many internals that a shard implementation is impossible? |
@HertzDevil I've been wondering about that. On the general side it's mostly having So, it's possible to externalize, and the I still personally lean to have it right into stdlib, ready to be used and extended (e.g. sync primitives: channels, wait groups, mutexes) with just a compilation flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still more questions, sorry!
src/crystal/system/print_error.cr
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is true regardless of this PR, so maybe we can extract it to its own PR?
It returns the current time from the monotonic clock in nanoseconds with additional guarantees that it won't raise or allocate memory. Note: the Win32 functions are guaranteed to never fail since Windows XP onwards, and on posix, clock_gettime may only fail on invalid input, so we can safely outlook errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minuscule suggestions
Co-authored-by: Sijawusz Pur Rahnama <[email protected]>
Closed in favor to #14659 |
Introduces a mechanism to trace the GC and scheduler operations as they happen, when they happen and what they do.
This has proven invaluable at understanding multi-threaded synchronization issues, that would have been impossible to reproduce using a debugger (you can't step manually and hope to reach a race condition: you need brute force and then analysis). In addition tracing the GC can lead to interesting stats about how much time is spent on GC (malloc, collect, ...), or find out how many allocations happened and when they happen. With both traces, we can cross analyze the lifetime of a program.
The tracing itself is relatively simple: each trace is a line with its section (gc, sched), the operation (malloc, enqueue, ...) then context metadata (thread, fiber, time + duration) and eventually metadata about the operation. The trace is meant to be easy to parse and read, to be grepped, searched, copy pasted, you name it.
Tracing is disabled by default, and must be enabled manually by compiling with
-Dtracing
. We then enable one or more sections to trace with theCRYSTAL_TRACE
environment variable, which is the list of section names separated by a comma (,
). For exampleCRYSTAL_TRACE=gc,sched
.Examples
A basic hello world on a single thread (GC only):Same, but multi-threaded and tracing the scheduler this time:
The crystal tool trace command has been added that will generate stats from a trace file. Using an MT trace of the hello world example above with both the GC and the scheduler enabled:
Note that the command isn't very performant. It's doing some effort to avoid allocating memory as much as possible (we can likely still optimize), but parsing 60 million events trace will still take dozens of seconds. You can use the
--fast
switch to skip some details to shave some time.Technical notes
The trace is buffered (on the stack) then written once to the standard error directly (without going through STDERR) and not written to a file. This is to avoid interleaved/smashed traces as much as possible when tracing from multiple threads (taking advantage).
Compiling with
-Dtracing
has a limited impact on the executable performance (by default nothing is traced). Enabling some traces then has more impact, but the executable is still somewhat performant (despite measuring & tracing millions of events) that race conditions are still easily reproducible for example.We also don't go though a channel like logs are, to avoid fiber context switches that would pollute the trace, but also because we can't: tracing will commence before the crystal application has even started.
Evolutions
We could augment the tracing to report more/less data about the program, for example only output GC collection stats (with more before/after details). We could also listen to a signal (and/or a keyboard shortcut) to stop the world and print details about the threads and schedulers.