No coordinator work #794

wks · 2023-04-18T11:00:05Z

Remove the concept of "coordinator work packets". Now ScheduleCollection and StopMutators are both executed on ordinary workers. The only work packet executed by the coordinator is EndOfGC.

Simplified the interaction between the coordinator and the workers. The coordinator only responds to the event that "all workers have parked". Removed the workers-to-coordinator channel. WorkerMonitor now has two Condvars, one (the existing one) for notifying workers about more work available, and another for notifying the coordinator that all workers have parked.

This PR completely removes the concept of "coordinator work". The primary goal is fixing the issue #792

The WorkerGroupState is basically equivalent to the group_sleep Boolean variable.

Known issue

~~I should probably fix #793 in this PR, too, given that I am making changes to the synchronisation mechanism.~~ That's a bit complicated. I am not fixing that in this PR.

const COORDINATOR_ONLY_STW: bool in Collection: It is no longer useful. I'll do a multi-repo refactoring to remove it from mmtk-core, mmtk-openjdk and mmtk-ruby.

Remove the concept of "coordinator work packets". Now ScheduleCollection and StopMutators are both executed on ordinary workers. The only work packet executed by the coordinator is EndOfGC. Simplified the interaction between the coordinator and the workers. The coordinator only responds to the event that "all workers have parked". Removed the workers-to-coordinator channel. WorkerMonitor now has two Condvars, one (the existing one) for notifying workers about more work available, and another for notifying the coordinator that all workers have parked.

qinsoon

Just some minor issues. It simplifies the code a lot, which is great. We may need to be careful about any synchronisation cost this change may bring, and we need to run binding tests to make sure this works.

qinsoon · 2023-04-19T05:55:17Z

src/scheduler/controller.rs

-            let event = self.receiver.poll_event();
-            let finished = self.process_event(event);
+            self.scheduler.worker_monitor.resume_and_wait();
+            let finished = self.on_all_workers_parked();


You can just remove on_all_workers_parked() -- it is simple enough to be included in the main loop.

src/scheduler/controller.rs

src/scheduler/worker.rs

qinsoon · 2023-04-19T07:10:18Z

src/scheduler/worker.rs

+/// The state of the worker group.  The worker group alternates between the `Sleeping` and the
+/// `Working` state.  Workers execute work packets in the `Working` state, but once workers entered
+/// the `Sleeping` state, they can not continue until the coordinator explicitly transitions the
+/// state back to `Working` after it found more work for workers to do.


Better make it clear that as long as one worker is executing, the group state is Working, and when none of the workers is executing, the group state is Sleeping.

The cause and effect is the opposite. The state dictates the behaviour instead of behaviour determining the state. The Working state allows workers to execute and forbids the coordinator from opening new buckets, while the Sleeping state forbids workers from executing packets and allows the coordinator to open buckets.

wenyuzhao

LGTM

qinsoon

LGTM

wks · 2023-04-23T14:00:11Z

build1 is master and build2 is this PR. I ran DaCapo Chopin with Immix on bobcat (Core i9-12900KF Alder Lake) and Immix and GenImmix on vole (Ryzen 9 7950X Zen 4). I didn't enable profile-guided optimisation.

It displays consistent improvement in STW time on all machines and both Immix and GenImmix, and GenImmix is improved in even greater extent. Not surprisingly, if we remove the so-called "coordinator work" and remove one communication channel, the synchronisation overhead is expected to go down.

Plotty links:

STW time (Immix, bobcat)

Total time (Immix, bobcat)

STW time (Immix, vole)

Total time (Immix, vole)

STW time (GenImmix, vole)

Total time (GenImmix, vole)

qinsoon · 2023-04-23T22:31:07Z

Wow, some benchmarks showed 40% improvement in GC time by removing a mechanism that we barely used.

k-sareen · 2023-04-24T00:23:23Z

Very cool! I think this is still with the improper min nursery size, right? We should investigate what is a proper min nursery size soon.

wks · 2023-04-24T02:08:48Z

This is still based on master, so I haven't applied any changes to the nursery size, yet.

I am not sure why the improvement is so significant.

Maybe it is just removing the overheads I introduced in the last PR: Let the coordinator thread open buckets #782
It may be because it eliminated some unnecessary synchronisation so that it affects short GCs (nursery GCs) more than long GCs (full GCs which always happen in non-generational Immix)
It may be a result of not using PGO. I am not sure whether we should use PGO for this kind of comparison because if two profiles are different, the difference in performance results may not be the result of the changes I made.

Anyway, it shows this change does not have negative impact on the performance. But we should investigate why this is happening using more advanced tools.

wks · 2023-04-24T02:10:10Z

Wow, some benchmarks showed 40% improvement in GC time by removing a mechanism that we barely used.

It's only 20% for GenImmix, and 10% for Immix. The zero point of the plots generated by Plotty is misleading.

wks · 2023-04-24T12:04:46Z

I re-ran lusearch with three builds.

build1 is master. Currently it has my previous PR Let the coordinator thread open buckets #782 merged, but not this PR.
build2 is this PR.
build3 is an older revision (the v0.18.0 tag), before my previous PR was merged.

build	with PR 782	with this PR
build1	yes	no
build2	yes	yes
build3	no	no

I ran it on both bobcat.moma and vole.moma. The times are normailsed to build3. Here are the results:

STW times on bobcat:

STW times on vole:

Total time on bobcat:

Total time on vole:

I think the results are consistent between bobcat and vole. My last PR (782) introduced performance regression, but this PR improves the performance instead, and it has improvement over the original state of build3.

wks · 2023-04-25T11:31:33Z

Three-way comparison results:

Some benchmarks haven't finished on bobcat for now. I'll update it later.

Note: build3 is the baseline. All numbers are normalised to build3.

build	with PR 782	with this PR
build1	yes	no
build2	yes	yes
build3	no	no

STW time (Immix, bobcat)

STW time (GenImmix, bobcat)

STW time (Immix, vole)

STW time (GenImmix, vole)

build2 has performance improvement over build3 for almost benchmarks, but jython sees serious problem. The STW time for GenImmix for jython is 3x greater than before, and it has small variation and can be reproduced on two machines. I think something is horribly wrong.

k-sareen · 2023-04-25T11:42:39Z

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

wks · 2023-04-26T02:30:25Z

I re-ran jython on vole.moma. This time, I used a 117M heap instead of 1428M. This time the number of GC is much greater.

Number of GC (Immix, not normalised):

Number of GC (GenImmix, not normalised):

Number of major GCs (GenImmix, not normalised):

STW time (Immix, not normalised):

STW time (GenImmix, not normalised):

Total time (Immix, not normalised):

Total time (GenImmix, not normalisd):

In this run, GenImmix triggered about 272 GCs, among which there is at most 1 major GC, and the GC time ranges from 480ms (build2) to 640ms (build1), with build3 in the middle (560ms). 480ms is still about 14% time reduction from 560ms.

I don't know how many of the 272 GCs are "kind 1" or "kind 2" GCs. Assume "kind 1" GC only takes 1ms and "kind 2" GC takes 15ms. There needs to be about 20 "kind 2" GCs among them so that the STW time will be 560ms.

So from this result, I think we didn't really made jython 3x slower. It was the way I did experiment that gave us misleading results.

wks · 2023-04-26T03:54:21Z

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

I think it is just non-determinism. As we discussed in today's meeting, if one GC is triggered at a very unfortunate moment, it will impact the set of live objects, and transitively the time to trigger the next GC. This influence can go on and on and eventually resulting in one run triggering more GC than another.

k-sareen · 2023-04-26T04:15:38Z

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

I think it is just non-determinism. As we discussed in today's meeting, if one GC is triggered at a very unfortunate moment, it will impact the set of live objects, and transitively the time to trigger the next GC. This influence can go on and on and eventually resulting in one run triggering more GC than another.

Could you try avrora on a tighter heap (or all benchmarks on a tighter heap for that matter) to confirm this?

wks · 2023-04-27T05:13:52Z

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

I think it is just non-determinism. As we discussed in today's meeting, if one GC is triggered at a very unfortunate moment, it will impact the set of live objects, and transitively the time to trigger the next GC. This influence can go on and on and eventually resulting in one run triggering more GC than another.

Could you try avrora on a tighter heap (or all benchmarks on a tighter heap for that matter) to confirm this?

I re-ran avrora and a few other benchmarks that have low GC counts again, with a smaller heap. avrora still only triggered GC less than 100 times, but is slightly better. The GC counts of other benchmarks (except avrora and batik) significantly increased. The results are consistent. We still see build2 has non-trivial speed-up in STW time for GenImmix over build3.

Note: jython failed with build3 with GenImmix because of SIGSEGV. I'll re-run jython with wenyuzhao/mmtk-openjdk@1c384fd cherry-picked.

Plotty link:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-04-26-Wed-092433&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^benchmark&

Here is the STW time for avrora (not normailsed).

wks · 2023-04-27T14:06:18Z

Here is the result of jython after this commit wenyuzhao/mmtk-openjdk@1c384fd is cherry-picked to the mmtk-openjdk repositories of all builds.

GC is triggered for about 600 times for GenImmix and about 320 times for Immix.

STW time (not normalised):

Total time (not normalised):

Plotty link:
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-04-27-Thu-055518&benchmark^build^invocation^iteration^mmtk_gc&GC^majorGC^time^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^mmtk_gc&

After this commit is applied, build3 no longer crashes, and build2 still has improvements over build3 in STW time by a significant amount.

wks · 2023-04-27T14:22:25Z

The reason of the performance anomaly with the jython benchmark is a result of two bugs:

There is a bug in the mmtk-openjdk binding that arraycopy sometimes deliver zero-sized slices to the region modbuf in the write barrier for GenImmix where the slice should have a positive size. This bug is fixed in wenyuzhao/mmtk-openjdk@1c384fd
There is a load balance problem in the implementation of the ProcessRegionModBuf work packet. It always generates one single ProcessEdgesWork no matter how many slices are there in the region modbuf. In the jython benchmark, the generated ProcessEdgesWork can contain more than 4000000 edges. This bug is reported in Load-balance problem in ProcessRegionModBuf #797

For reasons unknown, when running the jython benchmark on bobcat.moma and vole.moma, build3 triggered Bug 1, but neither build1 nor build2 trigger Bug1. Therefore, build3 did not record any edges from region modbuf (which is wrong), while both build1 and build2 recorded more than 4000000 edges (which is right) in two of the nursery GCs triggered. Due to Bug 2, the load balance problem caused the two affected nursery GCs to take much longer than usual. Both build1 and build2 are affected by Bug2, but Bug1 caused build3 to skip all those 4000000 edges, preventing Bug2 from manifesting, resulting in a speed-up that shouldn't happen.

Since we know the anomaly is not caused by this PR, I will merge this PR and leave the two bugs mentioned above to be fixed later.

This reverts commit 43e8a92.

We removed the feature for letting the coordinator call both `stop_all_mutators` and `resume_mutators` since #794 and #814. The documentation should reflect that.

wks force-pushed the no-coordinator-work branch from 1059667 to 4515e9b Compare April 18, 2023 11:41

wks added 2 commits April 19, 2023 11:49

Minor comment fix

ec6ec14

Remove old comment

9ac11b0

wks marked this pull request as ready for review April 19, 2023 03:59

wks requested review from wenyuzhao and qinsoon April 19, 2023 03:59

qinsoon reviewed Apr 19, 2023

View reviewed changes

wenyuzhao approved these changes Apr 19, 2023

View reviewed changes

qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label Apr 20, 2023

Minor fixes

a3c071b

qinsoon approved these changes Apr 20, 2023

View reviewed changes

wks merged commit 43e8a92 into mmtk:master Apr 27, 2023

wks mentioned this pull request May 14, 2023

Remove Scanning::COORDINATOR_ONLY_STW #813

Closed

5 tasks

wenyuzhao added a commit to wenyuzhao/mmtk-core that referenced this pull request Aug 1, 2023

Revert "No coordinator work (mmtk#794)"

9d81656

This reverts commit 43e8a92.

wks mentioned this pull request Oct 25, 2023

Update doc comments #1000

Merged

github-merge-queue bot pushed a commit that referenced this pull request Oct 25, 2023

Update doc comments (#1000)

59ff055

We removed the feature for letting the coordinator call both `stop_all_mutators` and `resume_mutators` since #794 and #814. The documentation should reflect that.

wks mentioned this pull request Dec 20, 2023

Remove the coordinator (controller) thread #1053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No coordinator work #794

No coordinator work #794

wks commented Apr 18, 2023 •

edited

Loading

qinsoon left a comment

qinsoon Apr 19, 2023

wks Apr 19, 2023

qinsoon Apr 19, 2023

wks Apr 19, 2023

wenyuzhao left a comment

qinsoon left a comment

wks commented Apr 23, 2023

qinsoon commented Apr 23, 2023

k-sareen commented Apr 24, 2023

wks commented Apr 24, 2023 •

edited

Loading

wks commented Apr 24, 2023

wks commented Apr 24, 2023

wks commented Apr 25, 2023 •

edited

Loading

k-sareen commented Apr 25, 2023

wks commented Apr 26, 2023

wks commented Apr 26, 2023

k-sareen commented Apr 26, 2023

wks commented Apr 27, 2023

wks commented Apr 27, 2023

wks commented Apr 27, 2023

No coordinator work #794

No coordinator work #794

Conversation

wks commented Apr 18, 2023 • edited Loading

Known issue

qinsoon left a comment

Choose a reason for hiding this comment

qinsoon Apr 19, 2023

Choose a reason for hiding this comment

wks Apr 19, 2023

Choose a reason for hiding this comment

qinsoon Apr 19, 2023

Choose a reason for hiding this comment

wks Apr 19, 2023

Choose a reason for hiding this comment

wenyuzhao left a comment

Choose a reason for hiding this comment

qinsoon left a comment

Choose a reason for hiding this comment

wks commented Apr 23, 2023

qinsoon commented Apr 23, 2023

k-sareen commented Apr 24, 2023

wks commented Apr 24, 2023 • edited Loading

wks commented Apr 24, 2023

wks commented Apr 24, 2023

wks commented Apr 25, 2023 • edited Loading

k-sareen commented Apr 25, 2023

wks commented Apr 26, 2023

wks commented Apr 26, 2023

k-sareen commented Apr 26, 2023

wks commented Apr 27, 2023

wks commented Apr 27, 2023

wks commented Apr 27, 2023

wks commented Apr 18, 2023 •

edited

Loading

wks commented Apr 24, 2023 •

edited

Loading

wks commented Apr 25, 2023 •

edited

Loading