Skip to content

Commit

Permalink
i#6938 sched migrate: Separate run queue per output
Browse files Browse the repository at this point in the history
Removes the global runqueue and global sched_lock_, replacing with
per-output runqueues which each have a lock inside a new struct
input_queue_t which clearly delineates what the lock protects.  The
unscheduled queue remains global and has its own lock as another
input_queue_t.  The output fields .active and .cur_time are now
atomics, as they are accessed from other outputs yet are separate from
the queue and its mutex.

Makes the runqueue lock usage narrow, avoiding holding locks across
the larger functions.  Establishes a lock ordering convention: input >
output > unsched.

The removal of the global sched_lock_ avoids the lock contention seen
on fast analyzers (the original design targeted heavyweight
simulators).  On a large internal trace with hundreds of threads on
>100 cores we were seeing 41% of lock attempts collide with
the global queue:
```
    [scheduler] Schedule lock acquired     :  72674364
    [scheduler] Schedule lock contended    :  30144911
```
With separate runqueues we see < 1 in 10,000 collide:
```
    [scheduler] Stats for output #0
    <...>
    [scheduler]   Runqueue lock acquired             :  34594996
    [scheduler]   Runqueue lock contended            :        29
    [scheduler] Stats for output #1
    <...>
    [scheduler]   Runqueue lock acquired             :  51130763
    [scheduler]   Runqueue lock contended            :        41
    <...>
    [scheduler]   Runqueue lock acquired             :  46305755
    [scheduler]   Runqueue lock contended            :        44
    [scheduler] Unscheduled queue lock acquired      :     27834
    [scheduler] Unscheduled queue lock contended     :       273
    $ egrep 'contend' OUT | awk '{n+=$NF}END{ print n}'
    11528
    $ egrep 'acq' OUT | awk '{n+=$NF}END{ print n}'
    6814820713
    (gdb) p 11528/6814820713.*100
    $1 = 0.00016916072315753086
```

Before an output goes idle, it attempts to steal work from another
output's runqueue.  A new input option is added controlling the
migration threshold to avoid moving jobs too frequently.  The stealing
is done inside eof_or_idle() which now returns a new internal status
code STATUS_STOLE so the various callers can be sure to read the next
record.

Adds a periodic rebalancing with a period equal to another new input
option.  Adds flexible_queue_t::back() for rebalancing to not take from
the front of the queues.

Updates an output going inactive and promoting everything-unscheduled
to use the new rebalancing.

Makes output_info_t.active atomic as it is read by other outputs
during stealing and rebalancing.

Adds statistics on the stealing and rebalancing instances.

Updates all of the unit tests, many of which now have different
resulting schedules.

Adds a new unit test targeting queue rebalancing.

Issue: #6938
  • Loading branch information
derekbruening committed Sep 13, 2024
1 parent a462db1 commit c9ce27c
Show file tree
Hide file tree
Showing 8 changed files with 1,098 additions and 403 deletions.
3 changes: 3 additions & 0 deletions clients/drcachesim/analyzer_multi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -564,6 +564,9 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::init_dynamic_schedule()
sched_ops.blocking_switch_threshold = op_sched_blocking_switch_us.get_value();
sched_ops.block_time_multiplier = op_sched_block_scale.get_value();
sched_ops.block_time_max_us = op_sched_block_max_us.get_value();
sched_ops.migration_threshold_us = op_sched_migration_threshold_us.get_value();
sched_ops.rebalance_period_us = op_sched_rebalance_period_us.get_value();
sched_ops.time_units_per_us = op_sched_time_units_per_us.get_value();
sched_ops.randomize_next_input = op_sched_randomize.get_value();
sched_ops.honor_direct_switches = !op_sched_disable_direct_switches.get_value();
#ifdef HAS_ZIP
Expand Down
9 changes: 9 additions & 0 deletions clients/drcachesim/common/memtrace_stream.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,15 @@ class memtrace_stream_t {
* i.e., the number of input migrations to this core.
*/
SCHED_STAT_MIGRATIONS,
/**
* Counts the number of times this output's runqueue became empty and it took
* work from another output's runqueue.
*/
SCHED_STAT_RUNQUEUE_STEALS,
/**
* Counts the number of output runqueue rebalances triggered by this output.
*/
SCHED_STAT_RUNQUEUE_REBALANCES,
/** Count of statistic types. */
SCHED_STAT_TYPE_COUNT,
};
Expand Down
19 changes: 19 additions & 0 deletions clients/drcachesim/common/options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -992,6 +992,25 @@ droption_t<bool> op_sched_disable_direct_switches(
"switch being determined by latency and the next input in the queue. The "
"TRACE_MARKER_TYPE_DIRECT_THREAD_SWITCH markers are not removed from the trace.");

droption_t<uint64_t> op_sched_migration_threshold_us(
DROPTION_SCOPE_ALL, "sched_migration_threshold_us", 500,
"Time in simulated microseconds before an input can be migrated across cores",
"The minimum time in simulated microseconds that must have elapsed since an input "
"last ran on a core before it can be migrated to another core.");

droption_t<uint64_t> op_sched_rebalance_period_us(
DROPTION_SCOPE_ALL, "sched_rebalance_period_us", 1500000,
"Period in microseconds at which core run queues are load-balanced",
"The period in simulated microseconds at which per-core run queues are re-balanced "
"to redistribute load.");

droption_t<double> op_sched_time_units_per_us(
DROPTION_SCOPE_ALL, "sched_time_units_per_us", 1000.,
"Time units per simulated microsecond",
"Time units (currently wall-clock time) per simulated microsecond. This scales all "
"of the -sched_*_us values as it concerts wall-clock time into the simulated "
"microseconds measured by those options.");

// Schedule_stats options.
droption_t<uint64_t>
op_schedule_stats_print_every(DROPTION_SCOPE_ALL, "schedule_stats_print_every",
Expand Down
3 changes: 3 additions & 0 deletions clients/drcachesim/common/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,9 @@ extern dynamorio::droption::droption_t<std::string> op_cpu_schedule_file;
extern dynamorio::droption::droption_t<std::string> op_sched_switch_file;
extern dynamorio::droption::droption_t<bool> op_sched_randomize;
extern dynamorio::droption::droption_t<bool> op_sched_disable_direct_switches;
extern dynamorio::droption::droption_t<uint64_t> op_sched_migration_threshold_us;
extern dynamorio::droption::droption_t<uint64_t> op_sched_rebalance_period_us;
extern dynamorio::droption::droption_t<double> op_sched_time_units_per_us;
extern dynamorio::droption::droption_t<uint64_t> op_schedule_stats_print_every;
extern dynamorio::droption::droption_t<std::string> op_syscall_template_file;
extern dynamorio::droption::droption_t<uint64_t> op_filter_stop_timestamp;
Expand Down
9 changes: 9 additions & 0 deletions clients/drcachesim/scheduler/flexible_queue.h
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,15 @@ class flexible_queue_t {
return entries_[rand_gen_() % size()]; // Undefined if empty.
}

// Returns an entry from the back -- or at least not from the front; it's not
// guaranteed to be the lowest priority, just not the highest.
T
back()
{
assert(!empty());
return entries_.back();
}

bool
empty() const
{
Expand Down
850 changes: 625 additions & 225 deletions clients/drcachesim/scheduler/scheduler.cpp

Large diffs are not rendered by default.

Loading

0 comments on commit c9ce27c

Please sign in to comment.