Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instrument_post_syscall() should be called after post_system_call() processes the real result #1

Closed
derekbruening opened this issue Nov 27, 2014 · 3 comments

Comments

@derekbruening
Copy link
Contributor

From [email protected] on February 11, 2009 13:53:44

I threw in the syscall API too quickly it seems: setting the
mcontext/result post-syscall should be after DR handles the syscall and
should be considered only a cosmetic result for fooling the app.

xref PR 207947 on syscall API feature

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=1

@derekbruening
Copy link
Contributor Author

From [email protected] on February 11, 2009 11:57:55

Owner: qin.zhao

@derekbruening
Copy link
Contributor Author

From [email protected] on February 11, 2009 12:17:13

This could cause other problems.
For example, in case of an application mmap, if I want to perform mmap twice, I need
set system call number by dr_syscall_set_sysnum(drcontext, SYS_mmap2) (i.e. eax =
192) and invoke dr_syscall_invoke_another. But the drio handler will later treat the
eax (192) as the mmap result (base) and cause segmentation fault.

@derekbruening
Copy link
Contributor Author

From [email protected] on February 22, 2009 20:40:10

Status: Done

derekbruening added a commit that referenced this issue May 20, 2023
If cmake 3.17+ is in use, enables retrying of failed tests up to 3x
with any one of them passing resulting in an overall pass.  This
avoids flaky tests marking the whole suite red, which is even more
problematic with merge queues.

Tested:
I made a test which fails 3/4 of the time:
  --------------------------------------------------
  add_test(bogus bash -c "exit \$((RANDOM % 4))")
  --------------------------------------------------
I made it easy to run just this test (gave it a label; disabled other
builds, etc.) for convenience and then ran:
  --------------------------------------------------
  $ ctest -VV -S ../src/suite/runsuite.cmake,64_only
  --------------------------------------------------
Which resulted in:
  --------------------------------------------------
  test 1
      Start 1: bogus

  1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))"
  1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64
  1: Test timeout computed to be: 600
  1/1 Test #1: bogus ............................***Failed    0.00 sec
      Start 1: bogus

  1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))"
  1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64
  1: Test timeout computed to be: 600
      Test #1: bogus ............................   Passed    0.00 sec

  100% tests passed, 0 tests failed out of 1
  --------------------------------------------------

Issue: #2204, #5873
Fixes #2204
derekbruening added a commit that referenced this issue May 20, 2023
If cmake 3.17+ is in use, enables retrying of failed tests up to 3x with
any one of them passing resulting in an overall pass. This avoids flaky
tests marking the whole suite red, which is even more problematic with
merge queues.

All of our Github Actions platforms (macos-11, windows-2019,
ubuntu-20.04, ubuntu-22.04) have cmake 3.26+ so this is enabled for all
of our GA CI tests.

Tested:
I made a test which fails 3/4 of the time:
```
  add_test(bogus bash -c "exit \$((RANDOM % 4))")
```
I made it easy to run just this test (gave it a label; disabled other
builds, etc.) for convenience and then ran:
```
  $ ctest -VV -S ../src/suite/runsuite.cmake,64_only
```
Which resulted in:
```
  test 1
      Start 1: bogus

  1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))"
  1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64
  1: Test timeout computed to be: 600
  1/1 Test #1: bogus ............................***Failed    0.00 sec
      Start 1: bogus

  1: Test command: /usr/bin/bash "-c" "exit $((RANDOM % 4))"
  1: Working Directory: /usr/local/google/home/bruening/dr/git/build_suite/build_debug-internal-64
  1: Test timeout computed to be: 600
      Test #1: bogus ............................   Passed    0.00 sec

  100% tests passed, 0 tests failed out of 1
```

Issue: #2204, #5873
Fixes #2204
derekbruening added a commit that referenced this issue Aug 8, 2023
Switches from using the tid in scheduler_launcher to distinguish
inputs to the input ordinal.  Tid values can be duplicated so they
should not be used as unique identifiers across workloads.

Tested: No automated test currently relies on the launcher; it is
there for experimentation and as an example for how to use the
scheduler, so we want it to use the recommended techniques.  I ran it
on the threadsig app and confirmed record and replay are using
ordinals:
  ===========================================================================
  $ rm -rf drmemtrace.*.dir; bin64/drrun -stderr_mask 12 -t drcachesim -offline -- ~/dr/test/threadsig 16 2000 && bin64/drrun -t drcachesim -simulator_type basic_counts -indir drmemtrace.*.dir > COUNTS 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -record_file record.zip > RECORD 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -replay_file record.zip > REPLAY 2>&1 && tail -n 4 RECORD REPLAY
  Estimation of pi is 3.141592674423126
  Received 89 alarms
  ==> RECORD <==
  Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7
  Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16
  Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10
  Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4

  ==> REPLAY <==
  Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7
  Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16
  Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10
  Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4
  ===========================================================================

Issue: #5843
derekbruening added a commit that referenced this issue Aug 9, 2023
Switches from using the tid in scheduler_launcher to distinguish inputs
to the input ordinal. Tid values can be duplicated so they should not be
used as unique identifiers across workloads.

Tested: No automated test currently relies on the launcher; it is there
for experimentation and as an example for how to use the scheduler, so
we want it to use the recommended techniques. I ran it on the threadsig
app and confirmed record and replay are using ordinals:
```
$ rm -rf drmemtrace.*.dir; bin64/drrun -stderr_mask 12 -t drcachesim -offline -- ~/dr/test/threadsig 16 2000 && bin64/drrun -t drcachesim -simulator_type basic_counts -indir drmemtrace.*.dir > COUNTS 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -record_file record.zip > RECORD 2>&1 && clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -replay_file record.zip > REPLAY 2>&1 && tail -n 4 RECORD REPLAY
Estimation of pi is 3.141592674423126
Received 89 alarms
==> RECORD <==
Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7
Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16
Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10
Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4

==> REPLAY <==
Core #0: 16 15 16 15 16 0 15 16 15 8 16 6 5 7
Core #1: 9 3 12 16 11 16 8 0 16 0 16 1 16
Core #2: 3 14 16 14 16 0 15 16 8 16 2 6 8 1 10
Core #3: 13 3 13 9 11 12 16 6 16 6 16 2 4
 ```

Issue: #5843
derekbruening added a commit that referenced this issue Aug 11, 2023
Removes the original (but never implemented) report_time() heartbeat
design in favor of the simulator passing the current time to a new
version of next_record().

Implements QUANTUM_TIME by recording the start time of each input when
it is first scheduled and comparing to the new time in next_record().
Switches are only done at instruction boundaries for simplicity of
interactions with record-replay and skipping.

Adds 2 unit tests.

Adds time support with wall-clock time to the scheduler_launcher.
This was tested manually on some sample traces.  For threadsig traces,
with DEPENDENCY_TIMESTAMPS, the quanta doesn't make a huge differences
as the timestamp ordering imposes significant constraints.  I added an
option to ignore the timestamps ("-no_honor_stamps") and there we
really see the effects of the smaller quanta with more context switches.

===========================================================================
With timestamp deps and a 2ms quantum (compare to 2ms w/o deps below):
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -honor_stamps
Core #0: 15 12 1 15 1 15 7 12 15 7 6 9 5
Core #1: 13 10 11 15 12 10 15 12 10 15 10 11 10 15 10 8 2
Core #2: 16 11 15 10 11 15 11 15 11 15 4 7 12 4 0 14
Core #3: 3 1 15 12 10 15 12 10 1 12 15 7 15 4 15

===========================================================================
Without, but a long quantum of 20ms:
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 20000 -sched_time -verbose 1 -no_honor_stamps
Core #0: 0 5 8 12 16 4 11 15
Core #1: 1 4 9 14 0 7 9 0
Core #2: 2 6 10 13 1 6 13 1
Core #3: 3 7 11 15 2 3 8 10 14 16

===========================================================================
Without, but a smaller quantum of 2ms:
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -no_honor_stamps
Core #0: 0 5 9 13 1 7 9 13 0 4 8 11 15 3 7 9 13 3 5 10 13 1 7 6 12 14 7 8 11 0 7 8 10 16 3 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 8 6 15 0 9 11 13
Core #1: 1 4 8 12 16 2 6 11 15 1 5 10 14 2 8 11 16 1 7 9 15 0 4 9 15 0 2 6 12 16 3 5 12 13 1 5 10 16 7 8 12 13 3 4 9 15 0 1 5 10 16 7 2 9 13 1 15
Core #2: 2 7 10 14 0 4 8 12 16 2 6 12 16 1 5 10 15 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 4 9 11 14 4 10 11 14 4 16 0
Core #3: 3 6 11 15 3 5 10 14 3 7 9 13 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 7 8 12 13 3 4 9 15 14 2 6 11 14 2 6 15 0 1 5 12 16 2 12 1

===========================================================================
Without, but a tiny quantum of 200us:
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 200 -sched_time -verbose 1 -no_honor_stamps
Core #0: 0 4 7 11 15 2 6 10 14 1 7 9 12 16 5 10 12 4 8 3 11 12 1 8 11 16 7 8 15 12 0 6 12 7 13 10 2 8 15 16 2 3 15 6 11 7 13 6 10 1 8 5 10 1 3 6 14 11 7 15 2 4 12 13 5 9 10 15 6 9 10 7 6 2 1 11 3 14 16 12 13 5 1 11 6 8 2 10 0 7 5 10 0 3 14 1 15 6 7 5 4 0 12 14 9 10 16 14 8 10 11 13 7 9 0 13 3 9 1 13 3 5 2 16 14 4 15 0 6 4 15 0 13 12 8 10 1 3 4 15 2 14 3 5 11 16 13 6 15 10 2 12 6 4 9 12 6 15 10 1 7 8 11 2 14 13 5 10 1 12 8 5 0 14 8 3 4 16 6 15 11 2 1 8 3 4 0 14 7 4 2 6 14 7 11 10 1 9 13 2 14 12 3 5 10 0 9 13 11 8 12 7 16 10 0 3 5 10 0 3 13 11 15 12 13 11 8 3 13 15 9 12 7 2 8 0 4 6 15 9 3 6 15 14 12 5 15 8 0 3 6 2 0 1 6 13 0 3 6 2 11 9 4 2 0 10 4 2 0 10 4 13 0 10 4 13 0 1 15 2 12 1 15 2 0 11 15 6 13 1 15 4 16 14 11 4 16 14 10 4 16 14 11 5 2 13 9 3 4 6 1 11 7 2 16 15 12 4 5 1 12 10 6 13 9 4 2 5 15 3 10 5 15 12 11 16 1 14 7 2 13 9 4 10 6 8 14 3 2 15 7 4 0 6 8 4 0 6 7 12 3 16 6 1 4 16 15 9 12 3 5 13 8 11 0 6 1 4 16 2 7 14 10 5 13 12 3 0 9 8 11 10 6 1 14 16 2 7 11 0 9 1 4 3 5 13 12 3 5 13 4 11 10 6 8 15 11 5 8 4 11 5 13 15 3 6 8 15 11 16 2 7 12 3 5 8 4 14 16 2 15 12 0 5 13 4 14 3 2 4 14 3 8 10 1 16 6 13 4 7 5 13 10 16 11 9 2 12 3 5 6 10 1 7 0 15 12 14 8 2 10 3 11 9 15 16 7 13 2 12 3 13 2 0 16 14 5 6 16 14 5 15 4 1 11 9 6 10 14 2 0 16 14 9 6 12 14 8 4 10 9 0 13 1 12 8 15
Core #1: 1 5 ... <ommitted rest for space but all are as long as Core #0>
===========================================================================

Issue: #5843
derekbruening added a commit that referenced this issue Aug 15, 2023
Removes the original (but never implemented) report_time() heartbeat
design in favor of the simulator passing the current time to a new
version of next_record().

Implements QUANTUM_TIME by recording the start time of each input when
it is first scheduled and comparing to the new time in next_record().
Switches are only done at instruction boundaries for simplicity of
interactions with record-replay and skipping.

Adds 2 unit tests.

Adds time support with wall-clock time to the scheduler_launcher. This
was tested manually on some sample traces. For threadsig traces, with
DEPENDENCY_TIMESTAMPS, the quanta doesn't make a huge differences as the
timestamp ordering imposes significant constraints. I added an option to
ignore the timestamps ("-no_honor_stamps") and there we really see the
effects of the smaller quanta with more context switches.

With timestamp deps and a 2ms quantum (compare to 2ms w/o deps below):
```
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -honor_stamps
Core #0: 15 12 1 15 1 15 7 12 15 7 6 9 5
Core #1: 13 10 11 15 12 10 15 12 10 15 10 11 10 15 10 8 2
Core #2: 16 11 15 10 11 15 11 15 11 15 4 7 12 4 0 14
Core #3: 3 1 15 12 10 15 12 10 1 12 15 7 15 4 15
```
Without, but a long quantum of 20ms:
```
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 20000 -sched_time -verbose 1 -no_honor_stamps
Core #0: 0 5 8 12 16 4 11 15
Core #1: 1 4 9 14 0 7 9 0
Core #2: 2 6 10 13 1 6 13 1
Core #3: 3 7 11 15 2 3 8 10 14 16
```
Without, but a smaller quantum of 2ms:
```
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 2000 -sched_time -verbose 1 -no_honor_stamps
Core #0: 0 5 9 13 1 7 9 13 0 4 8 11 15 3 7 9 13 3 5 10 13 1 7 6 12 14 7 8 11 0 7 8 10 16 3 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 8 6 15 0 9 11 13
Core #1: 1 4 8 12 16 2 6 11 15 1 5 10 14 2 8 11 16 1 7 9 15 0 4 9 15 0 2 6 12 16 3 5 12 13 1 5 10 16 7 8 12 13 3 4 9 15 0 1 5 10 16 7 2 9 13 1 15
Core #2: 2 7 10 14 0 4 8 12 16 2 6 12 16 1 5 10 15 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 1 5 10 16 7 8 12 13 3 4 9 11 14 4 10 11 14 4 16 0
Core #3: 3 6 11 15 3 5 10 14 3 7 9 13 0 4 6 12 14 2 8 11 16 3 5 10 13 1 4 9 15 14 2 6 11 0 7 8 12 13 3 4 9 15 14 2 6 11 14 2 6 15 0 1 5 12 16 2 12 1
```
Without, but a tiny quantum of 200us:
```
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 200 -sched_time -verbose 1 -no_honor_stamps
Core #0: 0 4 7 11 15 2 6 10 14 1 7 9 12 16 5 10 12 4 8 3 11 12 1 8 11 16 7 8 15 12 0 6 12 7 13 10 2 8 15 16 2 3 15 6 11 7 13 6 10 1 8 5 10 1 3 6 14 11 7 15 2 4 12 13 5 9 10 15 6 9 10 7 6 2 1 11 3 14 16 12 13 5 1 11 6 8 2 10 0 7 5 10 0 3 14 1 15 6 7 5 4 0 12 14 9 10 16 14 8 10 11 13 7 9 0 13 3 9 1 13 3 5 2 16 14 4 15 0 6 4 15 0 13 12 8 10 1 3 4 15 2 14 3 5 11 16 13 6 15 10 2 12 6 4 9 12 6 15 10 1 7 8 11 2 14 13 5 10 1 12 8 5 0 14 8 3 4 16 6 15 11 2 1 8 3 4 0 14 7 4 2 6 14 7 11 10 1 9 13 2 14 12 3 5 10 0 9 13 11 8 12 7 16 10 0 3 5 10 0 3 13 11 15 12 13 11 8 3 13 15 9 12 7 2 8 0 4 6 15 9 3 6 15 14 12 5 15 8 0 3 6 2 0 1 6 13 0 3 6 2 11 9 4 2 0 10 4 2 0 10 4 13 0 10 4 13 0 1 15 2 12 1 15 2 0 11 15 6 13 1 15 4 16 14 11 4 16 14 10 4 16 14 11 5 2 13 9 3 4 6 1 11 7 2 16 15 12 4 5 1 12 10 6 13 9 4 2 5 15 3 10 5 15 12 11 16 1 14 7 2 13 9 4 10 6 8 14 3 2 15 7 4 0 6 8 4 0 6 7 12 3 16 6 1 4 16 15 9 12 3 5 13 8 11 0 6 1 4 16 2 7 14 10 5 13 12 3 0 9 8 11 10 6 1 14 16 2 7 11 0 9 1 4 3 5 13 12 3 5 13 4 11 10 6 8 15 11 5 8 4 11 5 13 15 3 6 8 15 11 16 2 7 12 3 5 8 4 14 16 2 15 12 0 5 13 4 14 3 2 4 14 3 8 10 1 16 6 13 4 7 5 13 10 16 11 9 2 12 3 5 6 10 1 7 0 15 12 14 8 2 10 3 11 9 15 16 7 13 2 12 3 13 2 0 16 14 5 6 16 14 5 15 4 1 11 9 6 10 14 2 0 16 14 9 6 12 14 8 4 10 9 0 13 1 12 8 15
Core #1: 1 5 ... <ommitted rest for space but all are as long as Core #0>
```

Issue: #5843
derekbruening added a commit that referenced this issue Aug 18, 2023
Adds printing '.' for every record and '-' for waiting to the
scheduler unit tests and udpates all the expected output.  This makes
it much easier to understand some of the results as now the lockstep
timing all lines up.

Adds -print_every to the launcher and switches to printing letters for
a better output of what happened on each core (if #inputs<=26).
Example:
```
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 60000 -print_every 5000
Core #0: GGGGGGGGG,HH,F,B,G,I,A,CC,G,BB,A,FF,AA,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Core #1: D,C,D,B,H,FF,EE,CC,II,AA,C,D,G,HH,D,G,II,G,I,G,I,G,HH,BB,II,BB,C,H,I,AA,C,F,I,H,II,AA,C,H,A,H,F,CC,DD,C,BB,HH,CC,F,BB,C,D,H,BB,D,B,EE,I,E,DD,B,F,H,A,D,C,D,E,B,D,I,D,AA,E,DD,EE,CC,II,C,D,I,AA,DD,B,E,I,D,C,E,FF,E,BB,EE,FF,E,AA,D,E,DD,H,BB,HH,D,H,BB,I,AA,II,H,A,FF,H,I,HH,DD,I,H,F,DD,I,A,HH,AA,CC,BB,CC,BB,D,B,FF,H,F,D,I,DD,FF,C,A,C,AA,F,AA,EE,A,D,E,FF,AA,F,A,E,A,E,DD,EE,F,E,F,A
Core #2: F,E,F,C,F,H,I,B,HH,II,FF,CC,G,H,DD,E,A,G,H,G,DD,G,F,D,A,H,I,FF,H,C,A,CC,II,A,FF,C,I,F,CC,B,FF,C,B,H,CC,B,D,B,DD,B,F,I,F,II,D,A,DD,I,D,H,E,H,I,D,HH,FF,BB,II,AA,EE,B,A,BB,E,II,A,BB,A,HH,E,AA,E,F,A,DD,HH,F,H,A,E,I,FF,I,B,F,II,A,FF,D,H,DD,I,AA,F,D,FF,AA,D,A,HH,A,H,F,A,FF,C,F,B,F,C,F,AA,B,FF,D,F,DD,B,C,H,CC,B,C,E,D,EE,C,E,D,EE,F,DD,E,F,D,A,DD,E,D,EE,D,E,D,AA,D,A,DD,F,D,C,D
Core #3: E,A,F,A,D,I,DD,BB,AA,BB,DD,G,EE,AA,H,G,D,B,G,B,G,II,F,HH,B,AA,I,B,A,HH,CC,HH,F,A,FF,C,HH,BB,F,D,F,C,FF,H,C,FF,DD,AA,I,B,II,AA,I,A,B,A,F,A,C,I,B,H,A,F,C,A,C,EE,F,D,EE,CC,E,BB,E,DD,E,CC,B,EE,C,EE,B,I,E,D,E,II,H,B,EE,I,EE,B,II,F,EE,A,D,AA,DD,HH,F,A,F,HH,D,A,II,H,F,II,FF,CC,B,AA,F,A,C,FF,D,C,D,CC,B,C,DD,H,I,F,CC,A,F,C,FF,E,A,DD,E,D,A,FF,AA,EE,F,DD,FF,E,F,EE,FF,AA,EEEEEEEEEEEEEEEE
```

Issue: #5843
derekbruening added a commit that referenced this issue Aug 21, 2023
Adds printing '.' for every record and '-' for waiting to the scheduler
unit tests and udpates all the expected output. This makes it much
easier to understand some of the results as now the lockstep timing all
lines up.

Adds -print_every to the launcher and switches to printing letters for a
better output of what happened on each core (if #inputs<=26). Example:
```
$ clients/bin64/scheduler_launcher -trace_dir drmemtrace.*.dir/trace -num_cores 4 -sched_quantum 60000 -print_every 5000
Core #0: GGGGGGGGG,HH,F,B,G,I,A,CC,G,BB,A,FF,AA,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Core #1: D,C,D,B,H,FF,EE,CC,II,AA,C,D,G,HH,D,G,II,G,I,G,I,G,HH,BB,II,BB,C,H,I,AA,C,F,I,H,II,AA,C,H,A,H,F,CC,DD,C,BB,HH,CC,F,BB,C,D,H,BB,D,B,EE,I,E,DD,B,F,H,A,D,C,D,E,B,D,I,D,AA,E,DD,EE,CC,II,C,D,I,AA,DD,B,E,I,D,C,E,FF,E,BB,EE,FF,E,AA,D,E,DD,H,BB,HH,D,H,BB,I,AA,II,H,A,FF,H,I,HH,DD,I,H,F,DD,I,A,HH,AA,CC,BB,CC,BB,D,B,FF,H,F,D,I,DD,FF,C,A,C,AA,F,AA,EE,A,D,E,FF,AA,F,A,E,A,E,DD,EE,F,E,F,A
Core #2: F,E,F,C,F,H,I,B,HH,II,FF,CC,G,H,DD,E,A,G,H,G,DD,G,F,D,A,H,I,FF,H,C,A,CC,II,A,FF,C,I,F,CC,B,FF,C,B,H,CC,B,D,B,DD,B,F,I,F,II,D,A,DD,I,D,H,E,H,I,D,HH,FF,BB,II,AA,EE,B,A,BB,E,II,A,BB,A,HH,E,AA,E,F,A,DD,HH,F,H,A,E,I,FF,I,B,F,II,A,FF,D,H,DD,I,AA,F,D,FF,AA,D,A,HH,A,H,F,A,FF,C,F,B,F,C,F,AA,B,FF,D,F,DD,B,C,H,CC,B,C,E,D,EE,C,E,D,EE,F,DD,E,F,D,A,DD,E,D,EE,D,E,D,AA,D,A,DD,F,D,C,D
Core #3: E,A,F,A,D,I,DD,BB,AA,BB,DD,G,EE,AA,H,G,D,B,G,B,G,II,F,HH,B,AA,I,B,A,HH,CC,HH,F,A,FF,C,HH,BB,F,D,F,C,FF,H,C,FF,DD,AA,I,B,II,AA,I,A,B,A,F,A,C,I,B,H,A,F,C,A,C,EE,F,D,EE,CC,E,BB,E,DD,E,CC,B,EE,C,EE,B,I,E,D,E,II,H,B,EE,I,EE,B,II,F,EE,A,D,AA,DD,HH,F,A,F,HH,D,A,II,H,F,II,FF,CC,B,AA,F,A,C,FF,D,C,D,CC,B,C,DD,H,I,F,CC,A,F,C,FF,E,A,DD,E,D,A,FF,AA,EE,F,DD,FF,E,F,EE,FF,AA,EEEEEEEEEEEEEEEE
```

Issue: #5843
derekbruening added a commit that referenced this issue Feb 7, 2024
Adds a new scheduler option randomize_next_input and corresponding
launcher option -sched_randomize.  When enabled, priorities and
timestamps and FIFO ordering are ignored and instead a random next
element from the ready queue is selected.  This will be useful for
schedule sensitivity studies.

Adds a unit test.

Tested manually end-to-end as well:
```
  $ for ((i=0; i<10; ++i)); do bin64/drrun -t drcachesim -simulator_type schedule_stats -core_sharded -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -cores 2 -sched_randomize 2>&1 | grep schedule; done
  Core #0 schedule: HEFBF_
  Core #1 schedule: GCDFA
  Core #0 schedule: GDBH
  Core #1 schedule: CFAFEF__
  Core #0 schedule: GDBF__
  Core #1 schedule: EFCFAH
  Core #0 schedule: BHFDF__
  Core #1 schedule: EGFAC
  Core #0 schedule: HAFEF__
  Core #1 schedule: GDCFB
  Core #0 schedule: ABFGF__
  Core #1 schedule: CEDFH
  Core #0 schedule: HDBF_F_F__
  Core #1 schedule: ECGA
  Core #0 schedule: HDEFA
  Core #1 schedule: CBFGF__
  Core #0 schedule: FHGFCF_
  Core #1 schedule: DEAB
  Core #0 schedule: EFABF_
  Core #1 schedule: GCDFH
```

Fixes #6636
derekbruening added a commit that referenced this issue Feb 8, 2024
Adds a new scheduler option randomize_next_input and corresponding
launcher option -sched_randomize. When enabled, priorities and
timestamps and FIFO ordering are ignored and instead a random next
element from the ready queue is selected. This will be useful for
schedule sensitivity studies.

Adds a unit test.

Tested manually end-to-end as well:
```
  $ for ((i=0; i<10; ++i)); do bin64/drrun -t drcachesim -simulator_type schedule_stats -core_sharded -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -cores 2 -sched_randomize 2>&1 | grep schedule; done
  Core #0 schedule: HEFBF_
  Core #1 schedule: GCDFA
  Core #0 schedule: GDBH
  Core #1 schedule: CFAFEF__
  Core #0 schedule: GDBF__
  Core #1 schedule: EFCFAH
  Core #0 schedule: BHFDF__
  Core #1 schedule: EGFAC
  Core #0 schedule: HAFEF__
  Core #1 schedule: GDCFB
  Core #0 schedule: ABFGF__
  Core #1 schedule: CEDFH
  Core #0 schedule: HDBF_F_F__
  Core #1 schedule: ECGA
  Core #0 schedule: HDEFA
  Core #1 schedule: CBFGF__
  Core #0 schedule: FHGFCF_
  Core #1 schedule: DEAB
  Core #0 schedule: EFABF_
  Core #1 schedule: GCDFH
```

Fixes #6636
derekbruening added a commit that referenced this issue Sep 13, 2024
Removes the global runqueue and global sched_lock_, replacing with
per-output runqueues which each have a lock inside a new struct
input_queue_t which clearly delineates what the lock protects.  The
unscheduled queue remains global and has its own lock as another
input_queue_t.  The output fields .active and .cur_time are now
atomics, as they are accessed from other outputs yet are separate from
the queue and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions.  Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen
on fast analyzers (the original design targeted heavyweight
simulators).  On a large internal trace with hundreds of threads on
>100 cores we were seeing 41% of lock attempts collide with
the global queue:
```
    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911
```
With separate runqueues we see < 1 in 10,000 collide:
```
    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086
```

Before an output goes idle, it attempts to steal work from another
output's runqueue.  A new input option is added controlling the
migration threshold to avoid moving jobs too frequently.  The stealing
is done inside eof_or_idle() which now returns a new internal status
code STATUS_STOLE so the various callers can be sure to read the next
record.

Adds a periodic rebalancing with a period equal to another new input
option.  Adds flexible_queue_t::back() for rebalancing to not take from
the front of the queues.

Updates an output going inactive and promoting everything-unscheduled
to use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs
during stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different
resulting schedules.

Adds a new unit test targeting queue rebalancing.

Issue: #6938
derekbruening added a commit that referenced this issue Sep 13, 2024
Removes the global runqueue and global sched_lock_, replacing with
per-output runqueues which each have a lock inside a new struct
input_queue_t which clearly delineates what the lock protects.  The
unscheduled queue remains global and has its own lock as another
input_queue_t.  The output fields .active and .cur_time are now
atomics, as they are accessed from other outputs yet are separate from
the queue and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions.  Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen
on fast analyzers (the original design targeted heavyweight
simulators).  On a large internal trace with hundreds of threads on
>100 cores we were seeing 41% of lock attempts collide with
the global queue:
```
    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911
```
With separate runqueues we see < 1 in 10,000 collide:
```
    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086
```

Before an output goes idle, it attempts to steal work from another
output's runqueue.  A new input option is added controlling the
migration threshold to avoid moving jobs too frequently.  The stealing
is done inside eof_or_idle() which now returns a new internal status
code STATUS_STOLE so the various callers can be sure to read the next
record.

Adds a periodic rebalancing with a period equal to another new input
option.  Adds flexible_queue_t::back() for rebalancing to not take from
the front of the queues.

Updates an output going inactive and promoting everything-unscheduled
to use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs
during stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different
resulting schedules.

Adds a new unit test targeting queue rebalancing.

Issue: #6938
derekbruening added a commit that referenced this issue Sep 17, 2024
Removes the global runqueue and global sched_lock_, replacing with
per-output runqueues which each have a lock inside a new struct
input_queue_t which clearly delineates what the lock protects. The
unscheduled queue remains global and has its own lock as another
input_queue_t. The output fields .active and .cur_time are now atomics,
as they are accessed from other outputs yet are separate from the queue
and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions.  Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen on
fast analyzers (the original design targeted heavyweight simulators). On
a large internal trace with hundreds of threads on >100
cores we were seeing 41% of lock attempts collide with
the global queue:
```
    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911
```
With separate runqueues we see < 1 in 10,000 collide:
```
    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086
```

Before an output goes idle, it attempts to steal work from another
output's runqueue. A new input option is added controlling the migration
threshold to avoid moving jobs too frequently. The stealing is done
inside eof_or_idle() which now returns a new internal status code
STATUS_STOLE so the various callers can be sure to read the next record.

Adds a periodic rebalancing with a period equal to another new input
option. Adds flexible_queue_t::back() for rebalancing to not take from
the front of the queues.

Updates an output going inactive and promoting everything-unscheduled to
use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs during
stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different
resulting schedules.

Adds a new unit test targeting queue rebalancing.

Tested under ThreadSanitizer for race detection on a relatively large
trace on 90 cores.

Issue: #6938
derekbruening added a commit that referenced this issue Oct 2, 2024
Adds a new scheduler feature and CLI option exit_if_fraction_left.
This applies to -core_sharded and -core_serial modes.  When an input
reaches EOF, if the number of non-EOF inputs left as a fraction of the
original inputs is equal to or less than this value then the scheduler
exits (sets all outputs to EOF) rather than finishing off the final
inputs.  This helps avoid long sequences of idles during staggered
endings with fewer inputs left than cores and only a small fraction of
the total instructions left in those inputs.

The default value in scheduler_options_t is 0 as simulators are
typically already choosing to stop at some even point.  For analyzers,
however, via the command-line option, the default is 0.05 (i.e., 5%),
which when tested on an large internal trace helps eliminate much of
the final idle time from the cores (just about any value over 0.05
works well: it is not overly sensitive).

Compare the numbers below for today's default with a long idle time
and so distinct differences between the "cpu busy by time" and "cpu
busy by time, ignoring idle past last instr" stats on a 39-core
schedule-stats run of a moderately large trace, with key stats and the
1st 2 cores (for brevity) shown here:

  1567052521 instructions
   878027975 idles
       64.09% cpu busy by record count
       82.38% cpu busy by time
       96.81% cpu busy by time, ignoring idle past last instr
Core #0 schedule: CccccccOXHhUuuuuAaSEOGOWEWQqqqFffIiTETENWwwOWEeeeeeeACMmTQFfOWLWVvvvvFQqqqqYOWOooOWOYOYQOWO_O_W_O_W_O_W_O_WO_WO_O_O_O_O_O_OR_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_RY_YyyyySUuuOSISO_S_S_SOPpSOKO_KO_KCcDKWDB_B_____________________________________________
Core #1 schedule: KkLWSFUQPDddddddddXxSUSVRJWKkRNJBWUWwwTttGgRNKkkRWNTtFRWKkRNWUuuGULRFSRSYKkkkRYAYFffGSRYHRYHNWMDddddddddRYGgggggYHNWK_YAHYNnGYSNHWwwwwSWSNKSYyyWKNNWKNNGAKWGggNnNW_NNWE_E_EF__________________________________________________

And now with -exit_if_fraction_left 0.05, where we lose (1567052521 -
1564522227)/1567052521. = 0.16% of the instructions but drastically
reduce the tail from 14% of the time to less than 1% of the time:

  1564522227 instructions
   120512812 idles
       92.85% cpu busy by record count
       96.39% cpu busy by time
       97.46% cpu busy by time, ignoring idle past last instr
766.85user 6.33system 1:15.88elapsed 1018%CPU (0avgtext+0avgdata 4947364maxresident)k
Core #0 schedule: CccccccOXHKYEGGETRARrrPRTVvvvRrrNWwwOOKWVRRrPBbbXUVvvvvvOWKVLWVvvJjSOWKVUuTIiiiFPpppKAaaMFfffAHOKWAaGNBOWKAPPOABCWKPWOKWPCXxxxZOWKCccJSOSWKJUYRCOWKCcSOSUKkkkOROK_O_O_O_O_O
Core #1 schedule: KkLWSMmmFLSFffffffJjWBbGBUuuuuuuuuuuBDBJJRJWKkRNJWMBKkkRNWKkRNWKkkkRNWXxxxxxZOooAaUIiTHhhhSDNnnnHZzQNnnRNWXxxxxxRNWUuuRNWKXUuXRNKRWKNXxxRWKONNHRKWONURKWXRKXRKNW_KR_KkRK_KRKR_R_R_R_R_R_R_R_R_R_R_R__R__R__R___R___R___R___R___R

Fixes #6959
derekbruening added a commit that referenced this issue Oct 4, 2024
Adds a new scheduler feature and CLI option exit_if_fraction_inputs_left. This
applies to -core_sharded and -core_serial modes. When an input reaches
EOF, if the number of non-EOF inputs left as a fraction of the original
inputs is equal to or less than this value then the scheduler exits
(sets all outputs to EOF) rather than finishing off the final inputs.
This helps avoid long sequences of idles during staggered endings with
fewer inputs left than cores and only a small fraction of the total
instructions left in those inputs.

The default value in scheduler_options_t and the CLI option is 0.05 (i.e., 5%),
which when tested on an large internal trace helps eliminate much of the
final idle time from the cores without losing many instructions.

Compare the numbers below for today's default with a long idle time and
so distinct differences between the "cpu busy by time" and "cpu busy by
time, ignoring idle past last instr" stats on a 39-core schedule-stats
run of a moderately large trace, with key stats and the 1st 2 cores (for
brevity) shown here:

```
  1567052521 instructions
   878027975 idles
       64.09% cpu busy by record count
       82.38% cpu busy by time
       96.81% cpu busy by time, ignoring idle past last instr
Core #0 schedule: CccccccOXHhUuuuuAaSEOGOWEWQqqqFffIiTETENWwwOWEeeeeeeACMmTQFfOWLWVvvvvFQqqqqYOWOooOWOYOYQOWO_O_W_O_W_O_W_O_WO_WO_O_O_O_O_O_OR_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_RY_YyyyySUuuOSISO_S_S_SOPpSOKO_KO_KCcDKWDB_B_____________________________________________ 
Core #1 schedule: KkLWSFUQPDddddddddXxSUSVRJWKkRNJBWUWwwTttGgRNKkkRWNTtFRWKkRNWUuuGULRFSRSYKkkkRYAYFffGSRYHRYHNWMDddddddddRYGgggggYHNWK_YAHYNnGYSNHWwwwwSWSNKSYyyWKNNWKNNGAKWGggNnNW_NNWE_E_EF__________________________________________________
```

And now with -exit_if_fraction_inputs_left 0.05, where we lose (1567052521 -
1564522227)/1567052521. = 0.16% of the instructions but drastically
reduce the tail from 14% of the time to less than 1% of the time:

```
  1564522227 instructions
   120512812 idles
       92.85% cpu busy by record count
       96.39% cpu busy by time
       97.46% cpu busy by time, ignoring idle past last instr
Core #0 schedule: CccccccOXHKYEGGETRARrrPRTVvvvRrrNWwwOOKWVRRrPBbbXUVvvvvvOWKVLWVvvJjSOWKVUuTIiiiFPpppKAaaMFfffAHOKWAaGNBOWKAPPOABCWKPWOKWPCXxxxZOWKCccJSOSWKJUYRCOWKCcSOSUKkkkOROK_O_O_O_O_O 
Core #1 schedule: KkLWSMmmFLSFffffffJjWBbGBUuuuuuuuuuuBDBJJRJWKkRNJWMBKkkRNWKkRNWKkkkRNWXxxxxxZOooAaUIiTHhhhSDNnnnHZzQNnnRNWXxxxxxRNWUuuRNWKXUuXRNKRWKNXxxRWKONNHRKWONURKWXRKXRKNW_KR_KkRK_KRKR_R_R_R_R_R_R_R_R_R_R_R__R__R__R___R___R___R___R___R
```

Fixes #6959
derekbruening added a commit that referenced this issue Oct 4, 2024
Adds a new scheduler feature and CLI option exit_if_fraction_inputs_left. This
applies to -core_sharded and -core_serial modes. When an input reaches
EOF, if the number of non-EOF inputs left as a fraction of the original
inputs is equal to or less than this value then the scheduler exits
(sets all outputs to EOF) rather than finishing off the final inputs.
This helps avoid long sequences of idles during staggered endings with
fewer inputs left than cores and only a small fraction of the total
instructions left in those inputs.

The default value in scheduler_options_t and the CLI option is 0.05 (i.e., 5%),
which when tested on an large internal trace helps eliminate much of the
final idle time from the cores without losing many instructions.

Compare the numbers below for today's default with a long idle time and
so distinct differences between the "cpu busy by time" and "cpu busy by
time, ignoring idle past last instr" stats on a 39-core schedule-stats
run of a moderately large trace, with key stats and the 1st 2 cores (for
brevity) shown here:

```
  1567052521 instructions
   878027975 idles
       64.09% cpu busy by record count
       82.38% cpu busy by time
       96.81% cpu busy by time, ignoring idle past last instr
Core #0 schedule: CccccccOXHhUuuuuAaSEOGOWEWQqqqFffIiTETENWwwOWEeeeeeeACMmTQFfOWLWVvvvvFQqqqqYOWOooOWOYOYQOWO_O_W_O_W_O_W_O_WO_WO_O_O_O_O_O_OR_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_R_RY_YyyyySUuuOSISO_S_S_SOPpSOKO_KO_KCcDKWDB_B_____________________________________________ 
Core #1 schedule: KkLWSFUQPDddddddddXxSUSVRJWKkRNJBWUWwwTttGgRNKkkRWNTtFRWKkRNWUuuGULRFSRSYKkkkRYAYFffGSRYHRYHNWMDddddddddRYGgggggYHNWK_YAHYNnGYSNHWwwwwSWSNKSYyyWKNNWKNNGAKWGggNnNW_NNWE_E_EF__________________________________________________
```

And now with -exit_if_fraction_inputs_left 0.05, where we lose (1567052521 -
1564522227)/1567052521. = 0.16% of the instructions but drastically
reduce the tail from 14% of the time to less than 1% of the time:

```
  1564522227 instructions
   120512812 idles
       92.85% cpu busy by record count
       96.39% cpu busy by time
       97.46% cpu busy by time, ignoring idle past last instr
Core #0 schedule: CccccccOXHKYEGGETRARrrPRTVvvvRrrNWwwOOKWVRRrPBbbXUVvvvvvOWKVLWVvvJjSOWKVUuTIiiiFPpppKAaaMFfffAHOKWAaGNBOWKAPPOABCWKPWOKWPCXxxxZOWKCccJSOSWKJUYRCOWKCcSOSUKkkkOROK_O_O_O_O_O 
Core #1 schedule: KkLWSMmmFLSFffffffJjWBbGBUuuuuuuuuuuBDBJJRJWKkRNJWMBKkkRNWKkRNWKkkkRNWXxxxxxZOooAaUIiTHhhhSDNnnnHZzQNnnRNWXxxxxxRNWUuuRNWKXUuXRNKRWKNXxxRWKONNHRKWONURKWXRKXRKNW_KR_KkRK_KRKR_R_R_R_R_R_R_R_R_R_R_R__R__R__R___R___R___R___R___R
```

Fixes #6959
derekbruening added a commit that referenced this issue Oct 15, 2024
Adds a new interface trace_analysis_tool::preferred_shard_type() to
the drmemtrace framework to allow tools to request core-sharded
operation.

The cache simulator, TLB simulator, and schedule_stats tools override
the new interface to request core-sharded mode.

Unfortunately, it is not easy to detect core-sharded-on-disk traces in
the launcher, so the user must now pass `-no_core_sharded` when using
such traces with core-sharded-preferring tools to avoid the trace
being re-scheduled yet again.  Documentation for this is added and it
is turned into a fatal error since this re-scheduling there is almost
certainly user error.

In the launcher, if all tools prefer core-sharded, and the user did
not specify -no_core_sharded, core-sharded (or core-serial) mode is
enabled, with a -verbose 1+ message.
```
  $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool schedule_stats:cache_simulator
  Enabling -core_serial as all tools prefer it
  <...>
  Schedule stats tool results:
  Total counts:
             4 cores
             8 threads: 1257600, 1257602, 1257599, 1257603, 1257598, 1257604, 1257596, 1257601
        638938 instructions
  <...>
  Core #0 schedule: AEA_A_
  Core #1 schedule: BH_
  Core #2 schedule: CG
  Core #3 schedule: DF_
  <...>
  Cache simulation results:
  Core #0 (traced CPU(s): #0)
    L1I0 (size=32768, assoc=8, block=64, LRU) stats:
      Hits:                          123,659
  <...>
```

If at least one tool prefers core-sharded but others do not, a
-verbose 1+ message suggests running with an explicit -core_sharded.
```
  $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool cache_simulator:basic_counts
  Some tool(s) prefer core-sharded: consider re-running with -core_sharded or -core_serial enabled for best results.
```

Reduces the scheduler queue diagnostics by 5x as they seem too
frequent in short runs.

Updates the documentation to mention the new defaults.

Updates numerous drcachesim test output templates.

Keeps a couple of tests using thread-sharded by passing -no_core_serial.

Fixes #6949
egrimley-arm added a commit that referenced this issue Nov 26, 2024
For compatibility with newer versions of PHP.

The error (in generate_decoder.stderr) was

PHP Fatal error:  Uncaught TypeError: ksort(): Argument #1 ($array)
must be of type array, null given in .../sve_decode_json.php:900
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant