instrument_post_syscall() should be called after post_system_call() processes the real result #1

derekbruening · 2014-11-27T22:12:23Z

From [email protected] on February 11, 2009 13:53:44

I threw in the syscall API too quickly it seems: setting the
mcontext/result post-syscall should be after DR handles the syscall and
should be considered only a cosmetic result for fooling the app.

xref PR 207947 on syscall API feature

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=1

derekbruening · 2014-11-27T22:12:23Z

From [email protected] on February 11, 2009 11:57:55

Owner: qin.zhao

derekbruening · 2014-11-27T22:12:29Z

From [email protected] on February 11, 2009 12:17:13

This could cause other problems.
For example, in case of an application mmap, if I want to perform mmap twice, I need
set system call number by dr_syscall_set_sysnum(drcontext, SYS_mmap2) (i.e. eax =
192) and invoke dr_syscall_invoke_another. But the drio handler will later treat the
eax (192) as the mmap result (base) and cause segmentation fault.

derekbruening · 2014-11-27T22:12:34Z

From [email protected] on February 22, 2009 20:40:10

Status: Done

If cmake 3.17+ is in use, enables retrying of failed tests up to 3x with any one of them passing resulting in an overall pass. This avoids flaky tests marking the whole suite red, which is even more problematic with merge queues. Tested: I made a test which fails 3/4 of the time: -------------------------------------------------- add_test(bogus bash -c "exit \$((RANDOM % 4))") -------------------------------------------------- I made it easy to run just this test (gave it a label; disabled other builds, etc.) for convenience and then ran: -------------------------------------------------- $ ctest -VV -S ../src/suite/runsuite.cmake,64_only -------------------------------------------------- Which resulted in: -------------------------------------------------- test 1 Start 1: bogus 1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))" 1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64 1: Test timeout computed to be: 600 1/1 Test #1: bogus ............................***Failed 0.00 sec Start 1: bogus 1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))" 1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64 1: Test timeout computed to be: 600 Test #1: bogus ............................ Passed 0.00 sec 100% tests passed, 0 tests failed out of 1 -------------------------------------------------- Issue: #2204, #5873 Fixes #2204

If cmake 3.17+ is in use, enables retrying of failed tests up to 3x with any one of them passing resulting in an overall pass. This avoids flaky tests marking the whole suite red, which is even more problematic with merge queues. All of our Github Actions platforms (macos-11, windows-2019, ubuntu-20.04, ubuntu-22.04) have cmake 3.26+ so this is enabled for all of our GA CI tests. Tested: I made a test which fails 3/4 of the time: ``` add_test(bogus bash -c "exit \$((RANDOM % 4))") ``` I made it easy to run just this test (gave it a label; disabled other builds, etc.) for convenience and then ran: ``` $ ctest -VV -S ../src/suite/runsuite.cmake,64_only ``` Which resulted in: ``` test 1 Start 1: bogus 1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))" 1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64 1: Test timeout computed to be: 600 1/1 Test #1: bogus ............................***Failed 0.00 sec Start 1: bogus 1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))" 1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64 1: Test timeout computed to be: 600 Test #1: bogus ............................ Passed 0.00 sec 100% tests passed, 0 tests failed out of 1 ``` Issue: #2204, #5873 Fixes #2204

Switches from using the tid in scheduler_launcher to distinguish inputs to the input ordinal. Tid values can be duplicated so they should not be used as unique identifiers across workloads. Tested: No automated test currently relies on the launcher; it is there for experimentation and as an example for how to use the scheduler, so we want it to use the recommended techniques. I ran it on the threadsig app and confirmed record and replay are using ordinals: =========================================================================== $ rm -rf drmemtrace.*.dir; bin64/drrun -stderr_mask 12 -t drcachesim -offline -- ~/dr/test/threadsig 16 2000 && bin64/drrun -t drcachesim -simulator_type basic_counts -indir drmemtrace.*.dir > COUNTS 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -record_file record.zip > RECORD 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -replay_file record.zip > REPLAY 2>&1 && tail -n 4 RECORD REPLAY Estimation of pi is 3.141592674423126 Received 89 alarms ==> RECORD <== Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7 Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16 Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10 Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4 ==> REPLAY <== Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7 Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16 Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10 Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4 =========================================================================== Issue: #5843

Switches from using the tid in scheduler_launcher to distinguish inputs to the input ordinal. Tid values can be duplicated so they should not be used as unique identifiers across workloads. Tested: No automated test currently relies on the launcher; it is there for experimentation and as an example for how to use the scheduler, so we want it to use the recommended techniques. I ran it on the threadsig app and confirmed record and replay are using ordinals: ``` $ rm -rf drmemtrace.*.dir; bin64/drrun -stderr_mask 12 -t drcachesim -offline -- ~/dr/test/threadsig 16 2000 && bin64/drrun -t drcachesim -simulator_type basic_counts -indir drmemtrace.*.dir > COUNTS 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -record_file record.zip > RECORD 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -replay_file record.zip > REPLAY 2>&1 && tail -n 4 RECORD REPLAY Estimation of pi is 3.141592674423126 Received 89 alarms ==> RECORD <== Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7 Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16 Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10 Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4 ==> REPLAY <== Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7 Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16 Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10 Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4 ``` Issue: #5843

Removes the original (but never implemented) report_time() heartbeat design in favor of the simulator passing the current time to a new version of next_record(). Implements QUANTUM_TIME by recording the start time of each input when it is first scheduled and comparing to the new time in next_record(). Switches are only done at instruction boundaries for simplicity of interactions with record-replay and skipping. Adds 2 unit tests. Adds time support with wall-clock time to the scheduler_launcher. This was tested manually on some sample traces. For threadsig traces, with DEPENDENCY_TIMESTAMPS, the quanta doesn't make a huge differences as the timestamp ordering imposes significant constraints. I added an option to ignore the timestamps ("-no_honor_stamps") and there we really see the effects of the smaller quanta with more context switches. =========================================================================== With timestamp deps and a 2ms quantum (compare to 2ms w/o deps below): $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -honor_stamps Core #0: 15 12 1 15 1 15 7 12 15 7 6 9 5 Core #1: 13 10 11 15 12 10 15 12 10 15 10 11 10 15 10 8 2 Core #2: 16 11 15 10 11 15 11 15 11 15 4 7 12 4 0 14 Core #3: 3 1 15 12 10 15 12 10 1 12 15 7 15 4 15 =========================================================================== Without, but a long quantum of 20ms: $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 20000 -sched_time -verbose 1 -no_honor_stamps Core #0: 0 5 8 12 16 4 11 15 Core #1: 1 4 9 14 0 7 9 0 Core #2: 2 6 10 13 1 6 13 1 Core #3: 3 7 11 15 2 3 8 10 14 16 =========================================================================== Without, but a smaller quantum of 2ms: $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -no_honor_stamps Core #0: 0 5 9 13 1 7 9 13 0 4 8 11 15 3 7 9 13 3 5 10 13 1 7 6 12 14 7 8 11 0 7 8 10 16 3 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 8 6 15 0 9 11 13 Core #1: 1 4 8 12 16 2 6 11 15 1 5 10 14 2 8 11 16 1 7 9 15 0 4 9 15 0 2 6 12 16 3 5 12 13 1 5 10 16 7 8 12 13 3 4 9 15 0 1 5 10 16 7 2 9 13 1 15 Core #2: 2 7 10 14 0 4 8 12 16 2 6 12 16 1 5 10 15 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 4 9 11 14 4 10 11 14 4 16 0 Core #3: 3 6 11 15 3 5 10 14 3 7 9 13 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 7 8 12 13 3 4 9 15 14 2 6 11 14 2 6 15 0 1 5 12 16 2 12 1 =========================================================================== Without, but a tiny quantum of 200us: $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 200 -sched_time -verbose 1 -no_honor_stamps Core #0: 0 4 7 11 15 2 6 10 14 1 7 9 12 16 5 10 12 4 8 3 11 12 1 8 11 16 7 8 15 12 0 6 12 7 13 10 2 8 15 16 2 3 15 6 11 7 13 6 10 1 8 5 10 1 3 6 14 11 7 15 2 4 12 13 5 9 10 15 6 9 10 7 6 2 1 11 3 14 16 12 13 5 1 11 6 8 2 10 0 7 5 10 0 3 14 1 15 6 7 5 4 0 12 14 9 10 16 14 8 10 11 13 7 9 0 13 3 9 1 13 3 5 2 16 14 4 15 0 6 4 15 0 13 12 8 10 1 3 4 15 2 14 3 5 11 16 13 6 15 10 2 12 6 4 9 12 6 15 10 1 7 8 11 2 14 13 5 10 1 12 8 5 0 14 8 3 4 16 6 15 11 2 1 8 3 4 0 14 7 4 2 6 14 7 11 10 1 9 13 2 14 12 3 5 10 0 9 13 11 8 12 7 16 10 0 3 5 10 0 3 13 11 15 12 13 11 8 3 13 15 9 12 7 2 8 0 4 6 15 9 3 6 15 14 12 5 15 8 0 3 6 2 0 1 6 13 0 3 6 2 11 9 4 2 0 10 4 2 0 10 4 13 0 10 4 13 0 1 15 2 12 1 15 2 0 11 15 6 13 1 15 4 16 14 11 4 16 14 10 4 16 14 11 5 2 13 9 3 4 6 1 11 7 2 16 15 12 4 5 1 12 10 6 13 9 4 2 5 15 3 10 5 15 12 11 16 1 14 7 2 13 9 4 10 6 8 14 3 2 15 7 4 0 6 8 4 0 6 7 12 3 16 6 1 4 16 15 9 12 3 5 13 8 11 0 6 1 4 16 2 7 14 10 5 13 12 3 0 9 8 11 10 6 1 14 16 2 7 11 0 9 1 4 3 5 13 12 3 5 13 4 11 10 6 8 15 11 5 8 4 11 5 13 15 3 6 8 15 11 16 2 7 12 3 5 8 4 14 16 2 15 12 0 5 13 4 14 3 2 4 14 3 8 10 1 16 6 13 4 7 5 13 10 16 11 9 2 12 3 5 6 10 1 7 0 15 12 14 8 2 10 3 11 9 15 16 7 13 2 12 3 13 2 0 16 14 5 6 16 14 5 15 4 1 11 9 6 10 14 2 0 16 14 9 6 12 14 8 4 10 9 0 13 1 12 8 15 Core #1: 1 5 ... <ommitted rest for space but all are as long as Core #0> =========================================================================== Issue: #5843

Removes the original (but never implemented) report_time() heartbeat design in favor of the simulator passing the current time to a new version of next_record(). Implements QUANTUM_TIME by recording the start time of each input when it is first scheduled and comparing to the new time in next_record(). Switches are only done at instruction boundaries for simplicity of interactions with record-replay and skipping. Adds 2 unit tests. Adds time support with wall-clock time to the scheduler_launcher. This was tested manually on some sample traces. For threadsig traces, with DEPENDENCY_TIMESTAMPS, the quanta doesn't make a huge differences as the timestamp ordering imposes significant constraints. I added an option to ignore the timestamps ("-no_honor_stamps") and there we really see the effects of the smaller quanta with more context switches. With timestamp deps and a 2ms quantum (compare to 2ms w/o deps below): ``` $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -honor_stamps Core #0: 15 12 1 15 1 15 7 12 15 7 6 9 5 Core #1: 13 10 11 15 12 10 15 12 10 15 10 11 10 15 10 8 2 Core #2: 16 11 15 10 11 15 11 15 11 15 4 7 12 4 0 14 Core #3: 3 1 15 12 10 15 12 10 1 12 15 7 15 4 15 ``` Without, but a long quantum of 20ms: ``` $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 20000 -sched_time -verbose 1 -no_honor_stamps Core #0: 0 5 8 12 16 4 11 15 Core #1: 1 4 9 14 0 7 9 0 Core #2: 2 6 10 13 1 6 13 1 Core #3: 3 7 11 15 2 3 8 10 14 16 ``` Without, but a smaller quantum of 2ms: ``` $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -no_honor_stamps Core #0: 0 5 9 13 1 7 9 13 0 4 8 11 15 3 7 9 13 3 5 10 13 1 7 6 12 14 7 8 11 0 7 8 10 16 3 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 8 6 15 0 9 11 13 Core #1: 1 4 8 12 16 2 6 11 15 1 5 10 14 2 8 11 16 1 7 9 15 0 4 9 15 0 2 6 12 16 3 5 12 13 1 5 10 16 7 8 12 13 3 4 9 15 0 1 5 10 16 7 2 9 13 1 15 Core #2: 2 7 10 14 0 4 8 12 16 2 6 12 16 1 5 10 15 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 4 9 11 14 4 10 11 14 4 16 0 Core #3: 3 6 11 15 3 5 10 14 3 7 9 13 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 7 8 12 13 3 4 9 15 14 2 6 11 14 2 6 15 0 1 5 12 16 2 12 1 ``` Without, but a tiny quantum of 200us: ``` $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 200 -sched_time -verbose 1 -no_honor_stamps Core #0: 0 4 7 11 15 2 6 10 14 1 7 9 12 16 5 10 12 4 8 3 11 12 1 8 11 16 7 8 15 12 0 6 12 7 13 10 2 8 15 16 2 3 15 6 11 7 13 6 10 1 8 5 10 1 3 6 14 11 7 15 2 4 12 13 5 9 10 15 6 9 10 7 6 2 1 11 3 14 16 12 13 5 1 11 6 8 2 10 0 7 5 10 0 3 14 1 15 6 7 5 4 0 12 14 9 10 16 14 8 10 11 13 7 9 0 13 3 9 1 13 3 5 2 16 14 4 15 0 6 4 15 0 13 12 8 10 1 3 4 15 2 14 3 5 11 16 13 6 15 10 2 12 6 4 9 12 6 15 10 1 7 8 11 2 14 13 5 10 1 12 8 5 0 14 8 3 4 16 6 15 11 2 1 8 3 4 0 14 7 4 2 6 14 7 11 10 1 9 13 2 14 12 3 5 10 0 9 13 11 8 12 7 16 10 0 3 5 10 0 3 13 11 15 12 13 11 8 3 13 15 9 12 7 2 8 0 4 6 15 9 3 6 15 14 12 5 15 8 0 3 6 2 0 1 6 13 0 3 6 2 11 9 4 2 0 10 4 2 0 10 4 13 0 10 4 13 0 1 15 2 12 1 15 2 0 11 15 6 13 1 15 4 16 14 11 4 16 14 10 4 16 14 11 5 2 13 9 3 4 6 1 11 7 2 16 15 12 4 5 1 12 10 6 13 9 4 2 5 15 3 10 5 15 12 11 16 1 14 7 2 13 9 4 10 6 8 14 3 2 15 7 4 0 6 8 4 0 6 7 12 3 16 6 1 4 16 15 9 12 3 5 13 8 11 0 6 1 4 16 2 7 14 10 5 13 12 3 0 9 8 11 10 6 1 14 16 2 7 11 0 9 1 4 3 5 13 12 3 5 13 4 11 10 6 8 15 11 5 8 4 11 5 13 15 3 6 8 15 11 16 2 7 12 3 5 8 4 14 16 2 15 12 0 5 13 4 14 3 2 4 14 3 8 10 1 16 6 13 4 7 5 13 10 16 11 9 2 12 3 5 6 10 1 7 0 15 12 14 8 2 10 3 11 9 15 16 7 13 2 12 3 13 2 0 16 14 5 6 16 14 5 15 4 1 11 9 6 10 14 2 0 16 14 9 6 12 14 8 4 10 9 0 13 1 12 8 15 Core #1: 1 5 ... <ommitted rest for space but all are as long as Core #0> ``` Issue: #5843

Adds printing '.' for every record and '-' for waiting to the scheduler unit tests and udpates all the expected output. This makes it much easier to understand some of the results as now the lockstep timing all lines up. Adds -print_every to the launcher and switches to printing letters for a better output of what happened on each core (if #inputs<=26). Example: ``` $ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 60000 -print_every 5000 Core #0: GGGGGGGGG,HH,F,B,G,I,A,CC,G,BB,A,FF,AA,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG Core #1: D,C,D,B,H,FF,EE,CC,II,AA,C,D,G,HH,D,G,II,G,I,G,I,G,HH,BB,II,BB,C,H,I,AA,C,F,I,H,II,AA,C,H,A,H,F,CC,DD,C,BB,HH,CC,F,BB,C,D,H,BB,D,B,EE,I,E,DD,B,F,H,A,D,C,D,E,B,D,I,D,AA,E,DD,EE,CC,II,C,D,I,AA,DD,B,E,I,D,C,E,FF,E,BB,EE,FF,E,AA,D,E,DD,H,BB,HH,D,H,BB,I,AA,II,H,A,FF,H,I,HH,DD,I,H,F,DD,I,A,HH,AA,CC,BB,CC,BB,D,B,FF,H,F,D,I,DD,FF,C,A,C,AA,F,AA,EE,A,D,E,FF,AA,F,A,E,A,E,DD,EE,F,E,F,A Core #2: F,E,F,C,F,H,I,B,HH,II,FF,CC,G,H,DD,E,A,G,H,G,DD,G,F,D,A,H,I,FF,H,C,A,CC,II,A,FF,C,I,F,CC,B,FF,C,B,H,CC,B,D,B,DD,B,F,I,F,II,D,A,DD,I,D,H,E,H,I,D,HH,FF,BB,II,AA,EE,B,A,BB,E,II,A,BB,A,HH,E,AA,E,F,A,DD,HH,F,H,A,E,I,FF,I,B,F,II,A,FF,D,H,DD,I,AA,F,D,FF,AA,D,A,HH,A,H,F,A,FF,C,F,B,F,C,F,AA,B,FF,D,F,DD,B,C,H,CC,B,C,E,D,EE,C,E,D,EE,F,DD,E,F,D,A,DD,E,D,EE,D,E,D,AA,D,A,DD,F,D,C,D Core #3: E,A,F,A,D,I,DD,BB,AA,BB,DD,G,EE,AA,H,G,D,B,G,B,G,II,F,HH,B,AA,I,B,A,HH,CC,HH,F,A,FF,C,HH,BB,F,D,F,C,FF,H,C,FF,DD,AA,I,B,II,AA,I,A,B,A,F,A,C,I,B,H,A,F,C,A,C,EE,F,D,EE,CC,E,BB,E,DD,E,CC,B,EE,C,EE,B,I,E,D,E,II,H,B,EE,I,EE,B,II,F,EE,A,D,AA,DD,HH,F,A,F,HH,D,A,II,H,F,II,FF,CC,B,AA,F,A,C,FF,D,C,D,CC,B,C,DD,H,I,F,CC,A,F,C,FF,E,A,DD,E,D,A,FF,AA,EE,F,DD,FF,E,F,EE,FF,AA,EEEEEEEEEEEEEEEE ``` Issue: #5843

Adds a new scheduler option randomize_next_input and corresponding launcher option -sched_randomize. When enabled, priorities and timestamps and FIFO ordering are ignored and instead a random next element from the ready queue is selected. This will be useful for schedule sensitivity studies. Adds a unit test. Tested manually end-to-end as well: ``` $ for ((i=0; i<10; ++i)); do bin64/drrun -t drcachesim -simulator_type schedule_stats -core_sharded -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -cores 2 -sched_randomize 2>&1 | grep schedule; done Core #0 schedule: HEFBF_ Core #1 schedule: GCDFA Core #0 schedule: GDBH Core #1 schedule: CFAFEF__ Core #0 schedule: GDBF__ Core #1 schedule: EFCFAH Core #0 schedule: BHFDF__ Core #1 schedule: EGFAC Core #0 schedule: HAFEF__ Core #1 schedule: GDCFB Core #0 schedule: ABFGF__ Core #1 schedule: CEDFH Core #0 schedule: HDBF_F_F__ Core #1 schedule: ECGA Core #0 schedule: HDEFA Core #1 schedule: CBFGF__ Core #0 schedule: FHGFCF_ Core #1 schedule: DEAB Core #0 schedule: EFABF_ Core #1 schedule: GCDFH ``` Fixes #6636

Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex. Makes the runqueue lock usage narrow, avoiding holding locks across the larger functions. Establishes a lock ordering convention: input > output > unsched. The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100 cores we were seeing 41% of lock attempts collide with the global queue: ``` [scheduler] Schedule lock acquired : 72674364 [scheduler] Schedule lock contended : 30144911 ``` With separate runqueues we see < 1 in 10,000 collide: ``` [scheduler] Stats for output #0 <...> [scheduler] Runqueue lock acquired : 34594996 [scheduler] Runqueue lock contended : 29 [scheduler] Stats for output #1 <...> [scheduler] Runqueue lock acquired : 51130763 [scheduler] Runqueue lock contended : 41 <...> [scheduler] Runqueue lock acquired : 46305755 [scheduler] Runqueue lock contended : 44 [scheduler] Unscheduled queue lock acquired : 27834 [scheduler] Unscheduled queue lock contended : 273 $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}' 11528 $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}' 6814820713 (gdb) p 11528/6814820713.*100 $1 = 0.00016916072315753086 ``` Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record. Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues. Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing. Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing. Adds statistics on the stealing and rebalancing instances. Updates all of the unit tests, many of which now have different resulting schedules. Adds a new unit test targeting queue rebalancing. Issue: #6938

Removes the global runqueue and global sched_lock_, replacing with per-output runqueues which each have a lock inside a new struct input_queue_t which clearly delineates what the lock protects. The unscheduled queue remains global and has its own lock as another input_queue_t. The output fields .active and .cur_time are now atomics, as they are accessed from other outputs yet are separate from the queue and its mutex. Makes the runqueue lock usage narrow, avoiding holding locks across the larger functions. Establishes a lock ordering convention: input > output > unsched. The removal of the global sched_lock_ avoids the lock contention seen on fast analyzers (the original design targeted heavyweight simulators). On a large internal trace with hundreds of threads on >100 cores we were seeing 41% of lock attempts collide with the global queue: ``` [scheduler] Schedule lock acquired : 72674364 [scheduler] Schedule lock contended : 30144911 ``` With separate runqueues we see < 1 in 10,000 collide: ``` [scheduler] Stats for output #0 <...> [scheduler] Runqueue lock acquired : 34594996 [scheduler] Runqueue lock contended : 29 [scheduler] Stats for output #1 <...> [scheduler] Runqueue lock acquired : 51130763 [scheduler] Runqueue lock contended : 41 <...> [scheduler] Runqueue lock acquired : 46305755 [scheduler] Runqueue lock contended : 44 [scheduler] Unscheduled queue lock acquired : 27834 [scheduler] Unscheduled queue lock contended : 273 $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}' 11528 $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}' 6814820713 (gdb) p 11528/6814820713.*100 $1 = 0.00016916072315753086 ``` Before an output goes idle, it attempts to steal work from another output's runqueue. A new input option is added controlling the migration threshold to avoid moving jobs too frequently. The stealing is done inside eof_or_idle() which now returns a new internal status code STATUS_STOLE so the various callers can be sure to read the next record. Adds a periodic rebalancing with a period equal to another new input option. Adds flexible_queue_t::back() for rebalancing to not take from the front of the queues. Updates an output going inactive and promoting everything-unscheduled to use the new rebalancing. Makes output_info_t.active atomic as it is read by other outputs during stealing and rebalancing. Adds statistics on the stealing and rebalancing instances. Updates all of the unit tests, many of which now have different resulting schedules. Adds a new unit test targeting queue rebalancing. Tested under ThreadSanitizer for race detection on a relatively large trace on 90 cores. Issue: #6938

Adds a new scheduler feature and CLI option exit_if_fraction_left. This applies to -core_sharded and -core_serial modes. When an input reaches EOF, if the number of non-EOF inputs left as a fraction of the original inputs is equal to or less than this value then the scheduler exits (sets all outputs to EOF) rather than finishing off the final inputs. This helps avoid long sequences of idles during staggered endings with fewer inputs left than cores and only a small fraction of the total instructions left in those inputs. The default value in scheduler_options_t is 0 as simulators are typically already choosing to stop at some even point. For analyzers, however, via the command-line option, the default is 0.05 (i.e., 5%), which when tested on an large internal trace helps eliminate much of the final idle time from the cores (just about any value over 0.05 works well: it is not overly sensitive). Compare the numbers below for today's default with a long idle time and so distinct differences between the "cpu busy by time" and "cpu busy by time, ignoring idle past last instr" stats on a 39-core schedule-stats run of a moderately large trace, with key stats and the 1st 2 cores (for brevity) shown here: 1567052521 instructions 878027975 idles 64.09% cpu busy by record count 82.38% cpu busy by time 96.81% cpu busy by time, ignoring idle past last instr Core #0 schedule: CccccccOXHhUuuuuAaSEOGOWEWQqqqFffIiTETENWwwOWEeeeeeeACMmTQFfOWLWVvvvvFQqqqqYOWOooOWOYOYQOWO_O_W_O_W_O_W_O_WO_WO_O_O_O_O_O_OR_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_RY_YyyyySUuuOSISO_S_S_SOPpSOKO_KO_KCcDKWDB_B_____________________________________________ Core #1 schedule: KkLWSFUQPDddddddddXxSUSVRJWKkRNJBWUWwwTttGgRNKkkRWNTtFRWKkRNWUuuGULRFSRSYKkkkRYAYFffGSRYHRYHNWMDddddddddRYGgggggYHNWK_YAHYNnGYSNHWwwwwSWSNKSYyyWKNNWKNNGAKWGggNnNW_NNWE_E_EF__________________________________________________ And now with -exit_if_fraction_left 0.05, where we lose (1567052521 - 1564522227)/1567052521. = 0.16% of the instructions but drastically reduce the tail from 14% of the time to less than 1% of the time: 1564522227 instructions 120512812 idles 92.85% cpu busy by record count 96.39% cpu busy by time 97.46% cpu busy by time, ignoring idle past last instr 766.85user 6.33system 1:15.88elapsed 1018%CPU (0avgtext+0avgdata 4947364maxresident)k Core #0 schedule: CccccccOXHKYEGGETRARrrPRTVvvvRrrNWwwOOKWVRRrPBbbXUVvvvvvOWKVLWVvvJjSOWKVUuTIiiiFPpppKAaaMFfffAHOKWAaGNBOWKAPPOABCWKPWOKWPCXxxxZOWKCccJSOSWKJUYRCOWKCcSOSUKkkkOROK_O_O_O_O_O Core #1 schedule: KkLWSMmmFLSFffffffJjWBbGBUuuuuuuuuuuBDBJJRJWKkRNJWMBKkkRNWKkRNWKkkkRNWXxxxxxZOooAaUIiTHhhhSDNnnnHZzQNnnRNWXxxxxxRNWUuuRNWKXUuXRNKRWKNXxxRWKONNHRKWONURKWXRKXRKNW_KR_KkRK_KRKR_R_R_R_R_R_R_R_R_R_R_R__R__R__R___R___R___R___R___R Fixes #6959

Adds a new scheduler feature and CLI option exit_if_fraction_inputs_left. This applies to -core_sharded and -core_serial modes. When an input reaches EOF, if the number of non-EOF inputs left as a fraction of the original inputs is equal to or less than this value then the scheduler exits (sets all outputs to EOF) rather than finishing off the final inputs. This helps avoid long sequences of idles during staggered endings with fewer inputs left than cores and only a small fraction of the total instructions left in those inputs. The default value in scheduler_options_t and the CLI option is 0.05 (i.e., 5%), which when tested on an large internal trace helps eliminate much of the final idle time from the cores without losing many instructions. Compare the numbers below for today's default with a long idle time and so distinct differences between the "cpu busy by time" and "cpu busy by time, ignoring idle past last instr" stats on a 39-core schedule-stats run of a moderately large trace, with key stats and the 1st 2 cores (for brevity) shown here: ``` 1567052521 instructions 878027975 idles 64.09% cpu busy by record count 82.38% cpu busy by time 96.81% cpu busy by time, ignoring idle past last instr Core #0 schedule: CccccccOXHhUuuuuAaSEOGOWEWQqqqFffIiTETENWwwOWEeeeeeeACMmTQFfOWLWVvvvvFQqqqqYOWOooOWOYOYQOWO_O_W_O_W_O_W_O_WO_WO_O_O_O_O_O_OR_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_RY_YyyyySUuuOSISO_S_S_SOPpSOKO_KO_KCcDKWDB_B_____________________________________________ Core #1 schedule: KkLWSFUQPDddddddddXxSUSVRJWKkRNJBWUWwwTttGgRNKkkRWNTtFRWKkRNWUuuGULRFSRSYKkkkRYAYFffGSRYHRYHNWMDddddddddRYGgggggYHNWK_YAHYNnGYSNHWwwwwSWSNKSYyyWKNNWKNNGAKWGggNnNW_NNWE_E_EF__________________________________________________ ``` And now with -exit_if_fraction_inputs_left 0.05, where we lose (1567052521 - 1564522227)/1567052521. = 0.16% of the instructions but drastically reduce the tail from 14% of the time to less than 1% of the time: ``` 1564522227 instructions 120512812 idles 92.85% cpu busy by record count 96.39% cpu busy by time 97.46% cpu busy by time, ignoring idle past last instr Core #0 schedule: CccccccOXHKYEGGETRARrrPRTVvvvRrrNWwwOOKWVRRrPBbbXUVvvvvvOWKVLWVvvJjSOWKVUuTIiiiFPpppKAaaMFfffAHOKWAaGNBOWKAPPOABCWKPWOKWPCXxxxZOWKCccJSOSWKJUYRCOWKCcSOSUKkkkOROK_O_O_O_O_O Core #1 schedule: KkLWSMmmFLSFffffffJjWBbGBUuuuuuuuuuuBDBJJRJWKkRNJWMBKkkRNWKkRNWKkkkRNWXxxxxxZOooAaUIiTHhhhSDNnnnHZzQNnnRNWXxxxxxRNWUuuRNWKXUuXRNKRWKNXxxRWKONNHRKWONURKWXRKXRKNW_KR_KkRK_KRKR_R_R_R_R_R_R_R_R_R_R_R__R__R__R___R___R___R___R___R ``` Fixes #6959

Adds a new interface trace_analysis_tool::preferred_shard_type() to the drmemtrace framework to allow tools to request core-sharded operation. The cache simulator, TLB simulator, and schedule_stats tools override the new interface to request core-sharded mode. Unfortunately, it is not easy to detect core-sharded-on-disk traces in the launcher, so the user must now pass `-no_core_sharded` when using such traces with core-sharded-preferring tools to avoid the trace being re-scheduled yet again. Documentation for this is added and it is turned into a fatal error since this re-scheduling there is almost certainly user error. In the launcher, if all tools prefer core-sharded, and the user did not specify -no_core_sharded, core-sharded (or core-serial) mode is enabled, with a -verbose 1+ message. ``` $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool schedule_stats:cache_simulator Enabling -core_serial as all tools prefer it <...> Schedule stats tool results: Total counts: 4 cores 8 threads: 1257600, 1257602, 1257599, 1257603, 1257598, 1257604, 1257596, 1257601 638938 instructions <...> Core #0 schedule: AEA_A_ Core #1 schedule: BH_ Core #2 schedule: CG Core #3 schedule: DF_ <...> Cache simulation results: Core #0 (traced CPU(s): #0) L1I0 (size=32768, assoc=8, block=64, LRU) stats: Hits: 123,659 <...> ``` If at least one tool prefers core-sharded but others do not, a -verbose 1+ message suggests running with an explicit -core_sharded. ``` $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool cache_simulator:basic_counts Some tool(s) prefer core-sharded: consider re-running with -core_sharded or -core_serial enabled for best results. ``` Reduces the scheduler queue diagnostics by 5x as they seem too frequent in short runs. Updates the documentation to mention the new defaults. Updates numerous drcachesim test output templates. Keeps a couple of tests using thread-sharded by passing -no_core_serial. Fixes #6949

For compatibility with newer versions of PHP. The error (in generate_decoder.stderr) was PHP Fatal error: Uncaught TypeError: ksort(): Argument #1 ($array) must be of type array, null given in .../sve_decode_json.php:900

derekbruening added Migrated Component-API Priority-Medium Type-Bug Status-Fixed labels Nov 27, 2014

derekbruening closed this as completed Nov 27, 2014

derekbruening mentioned this issue May 20, 2023

i#2204: Retry flaky tests in suite #6075

Merged

derekbruening mentioned this issue Jun 8, 2023

i#5843 scheduler: Add syscall number markers #6096

Merged

derekbruening mentioned this issue Aug 8, 2023

i#5843 scheduler: Use input instead of tid in launcher #6255

Merged

Desperado17 mentioned this issue Aug 8, 2023

CRASH Segfault with DrMemory #6161

Open

derekbruening mentioned this issue Aug 11, 2023

i#5843 scheduler: Add time-based scheduling quanta #6265

Merged

ivankyluk mentioned this issue Oct 25, 2023

Incorrect function return address for tailcall when -record_replace_retaddr flag is used. #6394

Closed

atharvabhanage12 mentioned this issue Nov 1, 2023

SIGSEGV while trying to run memtrace & memval_simple from sample clients #6391

Open

akhileshkumar80 mentioned this issue Mar 1, 2024

SIGSEGV with on Risc-v target #6689

Closed

VulnDetector mentioned this issue May 9, 2024

Error happens when elf app doesn’t have a rseq_cs struct #6802

Open

ivankyluk mentioned this issue Jul 12, 2024

CRASH: drrun crashes if specified client tries to create ofstream object (SIGSEGV) #6877

Open

joalen mentioned this issue Aug 30, 2024

Detected Nondeterminism in Dr. Memory between two runs of the same executable under same configurations #6956

Closed

derekbruening mentioned this issue Oct 11, 2024

i#6938 sched migrate: Enforce migration threshold at the start #7038

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instrument_post_syscall() should be called after post_system_call() processes the real result #1

instrument_post_syscall() should be called after post_system_call() processes the real result #1

derekbruening commented Nov 27, 2014

derekbruening commented Nov 27, 2014

derekbruening commented Nov 27, 2014

derekbruening commented Nov 27, 2014

instrument_post_syscall() should be called after post_system_call() processes the real result #1

instrument_post_syscall() should be called after post_system_call() processes the real result #1

Comments

derekbruening commented Nov 27, 2014

derekbruening commented Nov 27, 2014

derekbruening commented Nov 27, 2014

derekbruening commented Nov 27, 2014