Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No coordinator work #794

Merged
merged 4 commits into from
Apr 27, 2023
Merged

No coordinator work #794

merged 4 commits into from
Apr 27, 2023

Conversation

wks
Copy link
Collaborator

@wks wks commented Apr 18, 2023

Remove the concept of "coordinator work packets". Now ScheduleCollection and StopMutators are both executed on ordinary workers. The only work packet executed by the coordinator is EndOfGC.

Simplified the interaction between the coordinator and the workers. The coordinator only responds to the event that "all workers have parked". Removed the workers-to-coordinator channel. WorkerMonitor now has two Condvars, one (the existing one) for notifying workers about more work available, and another for notifying the coordinator that all workers have parked.

This PR completely removes the concept of "coordinator work". The primary goal is fixing the issue #792

The WorkerGroupState is basically equivalent to the group_sleep Boolean variable.

Known issue

I should probably fix #793 in this PR, too, given that I am making changes to the synchronisation mechanism. That's a bit complicated. I am not fixing that in this PR.

const COORDINATOR_ONLY_STW: bool in Collection: It is no longer useful. I'll do a multi-repo refactoring to remove it from mmtk-core, mmtk-openjdk and mmtk-ruby.

Remove the concept of "coordinator work packets".  Now
ScheduleCollection and StopMutators are both executed on ordinary
workers.  The only work packet executed by the coordinator is EndOfGC.

Simplified the interaction between the coordinator and the workers.  The
coordinator only responds to the event that "all workers have parked".
Removed the workers-to-coordinator channel.  WorkerMonitor now has two
Condvars, one (the existing one) for notifying workers about more work
available, and another for notifying the coordinator that all workers
have parked.
@wks wks marked this pull request as ready for review April 19, 2023 03:59
@wks wks requested review from wenyuzhao and qinsoon April 19, 2023 03:59
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor issues. It simplifies the code a lot, which is great. We may need to be careful about any synchronisation cost this change may bring, and we need to run binding tests to make sure this works.

let event = self.receiver.poll_event();
let finished = self.process_event(event);
self.scheduler.worker_monitor.resume_and_wait();
let finished = self.on_all_workers_parked();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just remove on_all_workers_parked() -- it is simple enough to be included in the main loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

src/scheduler/controller.rs Show resolved Hide resolved
src/scheduler/worker.rs Show resolved Hide resolved
/// The state of the worker group. The worker group alternates between the `Sleeping` and the
/// `Working` state. Workers execute work packets in the `Working` state, but once workers entered
/// the `Sleeping` state, they can not continue until the coordinator explicitly transitions the
/// state back to `Working` after it found more work for workers to do.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better make it clear that as long as one worker is executing, the group state is Working, and when none of the workers is executing, the group state is Sleeping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cause and effect is the opposite. The state dictates the behaviour instead of behaviour determining the state. The Working state allows workers to execute and forbids the coordinator from opening new buckets, while the Sleeping state forbids workers from executing packets and allows the coordinator to open buckets.

Copy link
Member

@wenyuzhao wenyuzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinsoon qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label Apr 20, 2023
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wks
Copy link
Collaborator Author

wks commented Apr 23, 2023

build1 is master and build2 is this PR. I ran DaCapo Chopin with Immix on bobcat (Core i9-12900KF Alder Lake) and Immix and GenImmix on vole (Ryzen 9 7950X Zen 4). I didn't enable profile-guided optimisation.

It displays consistent improvement in STW time on all machines and both Immix and GenImmix, and GenImmix is improved in even greater extent. Not surprisingly, if we remove the so-called "coordinator work" and remove one communication channel, the synchronisation overhead is expected to go down.

Plotty links:

STW time (Immix, bobcat)
image

Total time (Immix, bobcat)
image

STW time (Immix, vole)
image

Total time (Immix, vole)
image

STW time (GenImmix, vole)
image

Total time (GenImmix, vole)
image

@qinsoon
Copy link
Member

qinsoon commented Apr 23, 2023

Wow, some benchmarks showed 40% improvement in GC time by removing a mechanism that we barely used.

@k-sareen
Copy link
Collaborator

Very cool! I think this is still with the improper min nursery size, right? We should investigate what is a proper min nursery size soon.

@wks
Copy link
Collaborator Author

wks commented Apr 24, 2023

This is still based on master, so I haven't applied any changes to the nursery size, yet.

I am not sure why the improvement is so significant.

  • Maybe it is just removing the overheads I introduced in the last PR: Let the coordinator thread open buckets #782
  • It may be because it eliminated some unnecessary synchronisation so that it affects short GCs (nursery GCs) more than long GCs (full GCs which always happen in non-generational Immix)
  • It may be a result of not using PGO. I am not sure whether we should use PGO for this kind of comparison because if two profiles are different, the difference in performance results may not be the result of the changes I made.

Anyway, it shows this change does not have negative impact on the performance. But we should investigate why this is happening using more advanced tools.

@wks
Copy link
Collaborator Author

wks commented Apr 24, 2023

Wow, some benchmarks showed 40% improvement in GC time by removing a mechanism that we barely used.

It's only 20% for GenImmix, and 10% for Immix. The zero point of the plots generated by Plotty is misleading.

@wks
Copy link
Collaborator Author

wks commented Apr 24, 2023

I re-ran lusearch with three builds.

build with PR 782 with this PR
build1 yes no
build2 yes yes
build3 no no

I ran it on both bobcat.moma and vole.moma. The times are normailsed to build3. Here are the results:

STW times on bobcat:

image

STW times on vole:

image

Total time on bobcat:

image

Total time on vole:

image

I think the results are consistent between bobcat and vole. My last PR (782) introduced performance regression, but this PR improves the performance instead, and it has improvement over the original state of build3.

@wks
Copy link
Collaborator Author

wks commented Apr 25, 2023

Three-way comparison results:

Some benchmarks haven't finished on bobcat for now. I'll update it later.

Note: build3 is the baseline. All numbers are normalised to build3.

build with PR 782 with this PR
build1 yes no
build2 yes yes
build3 no no

STW time (Immix, bobcat)
image

STW time (GenImmix, bobcat)
image

STW time (Immix, vole)
image

STW time (GenImmix, vole)
image

build2 has performance improvement over build3 for almost benchmarks, but jython sees serious problem. The STW time for GenImmix for jython is 3x greater than before, and it has small variation and can be reproduced on two machines. I think something is horribly wrong.

@k-sareen
Copy link
Collaborator

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

@wks
Copy link
Collaborator Author

wks commented Apr 26, 2023

I re-ran jython on vole.moma. This time, I used a 117M heap instead of 1428M. This time the number of GC is much greater.

Number of GC (Immix, not normalised):
image

Number of GC (GenImmix, not normalised):
image

Number of major GCs (GenImmix, not normalised):
image

STW time (Immix, not normalised):
image

STW time (GenImmix, not normalised):
image

Total time (Immix, not normalised):
image

Total time (GenImmix, not normalisd):
image

In this run, GenImmix triggered about 272 GCs, among which there is at most 1 major GC, and the GC time ranges from 480ms (build2) to 640ms (build1), with build3 in the middle (560ms). 480ms is still about 14% time reduction from 560ms.

I don't know how many of the 272 GCs are "kind 1" or "kind 2" GCs. Assume "kind 1" GC only takes 1ms and "kind 2" GC takes 15ms. There needs to be about 20 "kind 2" GCs among them so that the STW time will be 560ms.

So from this result, I think we didn't really made jython 3x slower. It was the way I did experiment that gave us misleading results.

@wks
Copy link
Collaborator Author

wks commented Apr 26, 2023

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

I think it is just non-determinism. As we discussed in today's meeting, if one GC is triggered at a very unfortunate moment, it will impact the set of live objects, and transitively the time to trigger the next GC. This influence can go on and on and eventually resulting in one run triggering more GC than another.

@k-sareen
Copy link
Collaborator

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

I think it is just non-determinism. As we discussed in today's meeting, if one GC is triggered at a very unfortunate moment, it will impact the set of live objects, and transitively the time to trigger the next GC. This influence can go on and on and eventually resulting in one run triggering more GC than another.

Could you try avrora on a tighter heap (or all benchmarks on a tighter heap for that matter) to confirm this?

@wks
Copy link
Collaborator Author

wks commented Apr 27, 2023

Why do the number of GCs change for avrora for GenImmix? This PR shouldn't change the GC triggering, right?

I think it is just non-determinism. As we discussed in today's meeting, if one GC is triggered at a very unfortunate moment, it will impact the set of live objects, and transitively the time to trigger the next GC. This influence can go on and on and eventually resulting in one run triggering more GC than another.

Could you try avrora on a tighter heap (or all benchmarks on a tighter heap for that matter) to confirm this?

I re-ran avrora and a few other benchmarks that have low GC counts again, with a smaller heap. avrora still only triggered GC less than 100 times, but is slightly better. The GC counts of other benchmarks (except avrora and batik) significantly increased. The results are consistent. We still see build2 has non-trivial speed-up in STW time for GenImmix over build3.

Note: jython failed with build3 with GenImmix because of SIGSEGV. I'll re-run jython with wenyuzhao/mmtk-openjdk@1c384fd cherry-picked.

Plotty link:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-04-26-Wed-092433&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^benchmark&

Here is the STW time for avrora (not normailsed).
image

@wks
Copy link
Collaborator Author

wks commented Apr 27, 2023

Here is the result of jython after this commit wenyuzhao/mmtk-openjdk@1c384fd is cherry-picked to the mmtk-openjdk repositories of all builds.

GC is triggered for about 600 times for GenImmix and about 320 times for Immix.

STW time (not normalised):
image

Total time (not normalised):
image

Plotty link:
https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-04-27-Thu-055518&benchmark^build^invocation^iteration^mmtk_gc&GC^majorGC^time^time.stw&|10&iteration^1^4|20&1^invocation|40&Histogram%20(with%20CI)^build^mmtk_gc&

After this commit is applied, build3 no longer crashes, and build2 still has improvements over build3 in STW time by a significant amount.

@wks
Copy link
Collaborator Author

wks commented Apr 27, 2023

The reason of the performance anomaly with the jython benchmark is a result of two bugs:

  1. There is a bug in the mmtk-openjdk binding that arraycopy sometimes deliver zero-sized slices to the region modbuf in the write barrier for GenImmix where the slice should have a positive size. This bug is fixed in wenyuzhao/mmtk-openjdk@1c384fd
  2. There is a load balance problem in the implementation of the ProcessRegionModBuf work packet. It always generates one single ProcessEdgesWork no matter how many slices are there in the region modbuf. In the jython benchmark, the generated ProcessEdgesWork can contain more than 4000000 edges. This bug is reported in Load-balance problem in ProcessRegionModBuf #797

For reasons unknown, when running the jython benchmark on bobcat.moma and vole.moma, build3 triggered Bug 1, but neither build1 nor build2 trigger Bug1. Therefore, build3 did not record any edges from region modbuf (which is wrong), while both build1 and build2 recorded more than 4000000 edges (which is right) in two of the nursery GCs triggered. Due to Bug 2, the load balance problem caused the two affected nursery GCs to take much longer than usual. Both build1 and build2 are affected by Bug2, but Bug1 caused build3 to skip all those 4000000 edges, preventing Bug2 from manifesting, resulting in a speed-up that shouldn't happen.

Since we know the anomaly is not caused by this PR, I will merge this PR and leave the two bugs mentioned above to be fixed later.

@wks wks merged commit 43e8a92 into mmtk:master Apr 27, 2023
@wks wks mentioned this pull request May 14, 2023
5 tasks
wenyuzhao added a commit to wenyuzhao/mmtk-core that referenced this pull request Aug 1, 2023
@wks wks mentioned this pull request Oct 25, 2023
github-merge-queue bot pushed a commit that referenced this pull request Oct 25, 2023
We removed the feature for letting the coordinator call both
`stop_all_mutators` and `resume_mutators` since
#794 and
#814. The documentation should
reflect that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The "wrong" worker wakes up when other workers have designated work.
4 participants