Send and receive work requests via proxy and multiplexer #6857

borkaehw · 2018-12-06T21:54:57Z

This is the attempt to solve issue #2832.
The design doc has been approved Multiplex persistent worker.

Two minor design changes from design doc

Number of WorkerProxy is still limited by --worker_max_instances.
We merge worker multiplexer sender and receiver to one WorkerMultiplxer, WorkerProxy sends request to worker process directly.

TODOs

Add tests.
Fix the issue that multiplexer silences unparseable responses from worker. Changes required in WorkerMultiplexer.java run().
WorkerMultiplexer.waitResponse sometimes throws NP exceptions.

jjudd · 2018-12-19T16:34:30Z

Friendly ping here.

jjudd · 2019-01-08T19:42:06Z

Friendly ping on this @philwo

jin · 2019-01-14T20:04:59Z

Ping @philwo

johnynek · 2019-01-15T19:01:05Z

is the review SLA like 2 months or something? It is really disheartening to watch these PRs go by with the only traffic being "ping, ping, ping..."

philwo · 2019-01-15T19:09:08Z

Sorry for the delay.

I was on parental leave until January 2nd, came back to the office and found myself firefighting a broken Buildkite CI, which I had to fix. This took until today.

Dear people - if you don’t get a response from someone, maybe ping someone else? It’s not like we don’t want to review PRs, but sometimes the assignee just can’t.

jin · 2019-01-15T19:11:12Z

@philwo I was under the impression that you are the reviewer for this particular design and implementation, and it was not clear if you wanted to hand this review over to someone else.

This brings up a discussion point of setting expectations w.r.t OOO. Sometimes people are OOO / on leave, but GitHub didn't have a good way to publicly announce that until recently.

philwo · 2019-01-15T19:35:42Z

@jin Yes, the problem is not that I didn't want to review this, the problem was that I couldn't do so and wasn't even aware of the pings until a few minutes ago due to some wrong settings on my side. I'm really sorry for that.

In this particular example, a couple of things went wrong, but luckily it's easy to do better:

I was out of office until recently and we had no fallback for this case. It's always possible that a single person is on longer vacation, leave or just overloaded with work. AI: Maybe we should always require a second reviewer, or we should have a way for people to escalate to someone in the team who knows the bigger picture of everyone's availability and can make a call in cases like this. I'll also take care to set my GitHub out-of-office settings when going on vacation and I'll teach others in the team about that new feature.
The "code owners" feature assigned this to me and only me. AI: We should ensure that there are at least two people for every component in our owners configuration. Or create GitHub teams that actually match our team structure and figure out if we can get the code owners feature to assign work to teams instead of individual people.
My GitHub notifications got lost in the pile of work e-mails I get everyday, so I didn't even see the pings until now. Sorry! AI: I fixed that by setting better notification settings, Gmail filters and labels now, so this shouldn't happen again, but in general, if I don't seem to respond on one channel, feel free to ping me over e-mail, chat. I might just accidentally not see your pings. I'll also take care to go through my notification mails regularly to check that I didn't miss something.

johnynek · 2019-01-15T19:36:35Z

@jin assigned this to @philwo on 12/10. I assumed he had state since you are both at google. If two googlers have trouble keeping the state, how can external people know that we should have timed out, and somehow find a different googler to ping?

Can I suggest that you all put someone on triage for PRs, maybe a rotating position (even each week) to make sure PRs have been replied to/are properly assigned and have some ETA if there will be significant waits (e.g. in this case the reviewer was on paternity leave so we could have been informed there was an OOO-situation that would delay this for 1-2 months?)

jin · 2019-01-15T19:45:13Z

@philwo those are terrific suggestions, and we should definitely implement them. I'll bring this up in the next DevEx gardener meeting. I also want to apologize for not reaching out to you on other channels, or ask around to see if there's anyone else available to review this.

We can also augment the design review process to always require two reviewers (one primary, one secondary), instead of one lead reviewer. cc @laurentlb

Can I suggest that you all put someone on triage for PRs, maybe a rotating position (even each week) to make sure PRs have been replied to/are properly assigned and have some ETA if there will be significant waits (e.g. in this case the reviewer was on paternity leave so we could have been informed there was an OOO-situation that would delay this for 1-2 months?)

@johnynek actually, this is happening right now, but we're in the stage of figuring out a good set of SLOs for issue and PR responses. We shouldn't bottleneck reviews on a single person, as @philwo suggested.

philwo · 2019-01-15T20:09:55Z

@jin Thanks so much! Let's discuss some ideas and action items in the meeting. I'm happy to volunteer to write up a doc with the stuff we come up with, I guess that's the least I can do after my embarrassing mess-up with the Gmail filters hiding all these notification e-mails 😔

Then we can distribute the knowledge about working efficiently with GitHub in each of our offices and make sure that we follow up on the AIs, to make sure that this doesn't happen again... WDYT?

jin · 2019-01-15T21:03:40Z

@philwo that works, let's take this offline from this thread :-) thanks!

laurentlb · 2019-01-15T21:11:18Z

We updated the Contributing page a month ago, it now says:

Wait for a Bazel team member to assign you a reviewer. It should be done in 2
business days (excluding holidays in the USA and Germany). If you do not get
a reviewer within that time frame, you can ask for one by sending a mail to
[email protected].

Work with the reviewer to complete a code review. For each change, create a
new commit and push it to make changes to your pull request. If the review
takes too long (e.g. reviewer is unresponsive), please also send an email to
[email protected].

As mentioned above, we should improve the process so that it doesn't happen anymore. But if you have any other bad experience, it's useful to report it. Even a week of waiting with no news is way too long.

SrodriguezO · 2019-09-04T21:07:03Z

Force push was just rebasing on master

philwo · 2019-09-09T12:53:38Z

Thank you! Don't know about @jmmv's busyness, but I'll probably have time to have a look at this next week Monday.

borkaehw · 2019-10-02T18:22:15Z

NP exceptions issue has been fixed.

philwo · 2019-10-02T18:53:51Z

Thank you, @borkaehw! I reviewed the code today and will try to import it when I’m back in the office.

philwo · 2019-10-14T08:41:05Z

There's one test failure that I get repeatedly on presubmits. I'm going to disable that test:

** test_build_fails_when_worker_returns_junk ***********************************
$TEST_TMPDIR defined: output root default is '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151' and max_idle_secs default is '15'.
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Writing tracer profile to '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/command.profile.gz'
-- Test log: -----------------------------------------------------------
$TEST_TMPDIR defined: output root default is '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151' and max_idle_secs default is '15'.
INFO: Writing tracer profile to '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/command.profile.gz'
Loading: 
Loading: 0 packages loaded
Analyzing: target //testvgyDnT:hello_world_2 (0 packages loaded, 0 targets configured)
INFO: Analyzed target //testvgyDnT:hello_world_2 (0 packages loaded, 1 target configured).
INFO: Found 1 target...
INFO: SpawnActionContextMap: "" = [WorkerSpawnStrategy, StandaloneSpawnStrategy]
INFO: ContextMap: Context = BazelWorkspaceStatusActionContext
INFO: ContextMap: CppIncludeExtractionContext = DummyCppIncludeExtractionContext
INFO: ContextMap: CppIncludeScanningContext = DummyCppIncludeScanningContext
INFO: ContextMap: FileWriteActionContext = FileWriteStrategy
INFO: ContextMap: SpawnActionContext = ProxySpawnActionContext
INFO: ContextMap: SpawnCache = NoSpawnCache
INFO: ContextMap: SymlinkTreeActionContext = SymlinkTreeStrategy
INFO: ContextMap: TemplateExpansionContext = LocalTemplateExpansionStrategy
INFO: ContextMap: TestActionContext = ExclusiveTestStrategy
[0 / 3] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (2 actions, 1 running)
SUBCOMMAND: # //testvgyDnT:hello_world_2 [action 'Working on hello_world_2', configuration: 9a088452797200c5bf92e0469646fb64]
(cd /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/execroot/main && \
  exec env - \
  bazel-out/host/bin/testvgyDnT/worker '--poison_after=1' @bazel-out/k8-fastbuild/bin/testvgyDnT/hello_world_2_worker_input)
ERROR: A crash occurred while bazel was trying to handle a crash! Please file a bug against bazel and include the information below.
Original uncaught exception:
java.lang.NullPointerException
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:198)
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:207)
Exception encountered during BugReport#handleCrash:
java.lang.NullPointerException
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:198)
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:207)

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/server/jvm.out')

borkaehw · 2019-10-14T17:00:13Z

@philwo Thank you so much. I am happy to address related issues in the future.

I believe it's still an optional feature, should I draft a doc for how to turn it on?

philwo · 2019-10-15T13:25:36Z

@borkaehw Documentation is always welcome, so yes, thank you :)

borkaehw · 2019-10-15T16:05:27Z

I am not sure about the steps here. Where and How can I contribute to the documentation?

philwo · 2019-10-21T14:09:50Z

@borkaehw I think you can create a new Markdown file, e.g. persistent-workers.md, here:
https://github.com/bazelbuild/bazel/blob/master/site/docs/

We can then find out how to wire it up so that it shows in the menu - I'm also not 100% sure how this currently works. ;)

Maybe a good example would be this one:
https://github.com/bazelbuild/bazel/blob/master/site/docs/build-event-protocol.md

Here's how it looks on the website:
https://docs.bazel.build/versions/master/build-event-protocol.html

philwo · 2019-10-21T14:12:50Z

It seems like the bazel_worker_multiplexer_test is still flaky, at least this one times out regularly in our internal Google CI: "test_multiple_target_with_delay".

Example log:

[14 / 15] Working on hello_world_3; 4s worker
[14 / 15] Working on hello_world_3; 15s worker
[14 / 15] Working on hello_world_3; 35s worker
[14 / 15] Working on hello_world_3; 50s worker
[14 / 15] Working on hello_world_3; 83s worker
[14 / 15] Working on hello_world_3; 105s worker
[14 / 15] Working on hello_world_3; 130s worker
[14 / 15] Working on hello_world_3; 158s worker
[14 / 15] Working on hello_world_3; 191s worker
[14 / 15] Working on hello_world_3; 229s worker
[14 / 15] Working on hello_world_3; 273s worker
[14 / 15] Working on hello_world_3; 323s worker
[14 / 15] Working on hello_world_3; 380s worker
[14 / 15] Working on hello_world_3; 447s worker
[14 / 15] Working on hello_world_3; 523s worker
[14 / 15] Working on hello_world_3; 610s worker

Blaze caught terminate signal; shutting down.

------------------------------------------------------------------------
test_multiple_target_with_delay FAILED: terminated by signal TERM.
third_party/bazel/src/test/shell/integration/bazel_worker_multiplexer_test:549: in call to main

borkaehw · 2019-10-21T14:40:26Z

@philwo I see. It seems like response message got lost so worker was waiting for response indefinitely. Do you have idea how could this happen? Do we need to disable more tests?

borkaehw · 2019-10-25T19:47:35Z

@philwo PR for documentation has been added. Thanks. #10108

A documentation for Multiplex Workers. See the original PR for details. #6857 Fix: #2832 Closes #10108. PiperOrigin-RevId: 279950386

larsrc-google · 2020-07-27T10:59:35Z

Hi borkaehw, I recently joined the Bazel team and I'm picking up work on workers. I did a simple multi-threaded version of our internal JavaBuilder, and was surprised to find it performed worse in my benchmarks than having multiple separate workers. Did you get the hoped-for speed/RAM usage improvement out of your multiplex workers?

borkaehw · 2020-07-27T16:47:01Z

We expect it to use less RAM since it should spawn less JVMs, and should have almost the same speed. We implemented it to solve the massive memory usage of Bazel. How worse it performed?

larsrc-google · 2020-07-28T08:48:05Z

I saw slowdowns of a factor 2, more if I used many multiplex workers. But I am starting to find out why.

…

-Lars

SrodriguezO · 2020-08-19T18:30:07Z

Hey @larsrc-google,

This sounds concerning, but we are unable to reproduce the behavior you're observing. We tried building our Scala services with and without multiplex workers, and we noticed a ~10% performance improvement when we use multiplex workers (without caching and accross 3 trials).

We used the following higherkindness/rules_scala commits for these benchmarks:

multiplex workers enabled: 57949194155958241ffd3aaff0f275b3fee5ee62
multiplex workers disabled: 8674ae18f3e0be846a5e68d34d152da412c27529

The only difference being the multiplex worker revert on 8674ae18f3e0be846a5e68d34d152da412c27529

We ran these against our own codebase, so we cannot share the source we built. Do you have a sample and/or OSS repo you're using for your investigation we could check out?

larsrc-google · 2020-08-19T20:02:57Z

I believe my commit 7be7aed fix the problem I was seeing - having a request id 0 made that request blocking, so no more multiplex request could be started until that one finished. Depending on the distribution of requests, that could make it a lot slower.

SrodriguezO · 2020-08-26T00:23:34Z

Interesting. We don't have that commit on the version of Bazel we're using (3.3), but we still couldn't reproduce the issue. I tried patching Bazel 3.3 with your change to see if we noticed any performance improvements, but performance is the same as before with multiplex workers enabled 🤔

larsrc-google · 2020-08-26T07:10:19Z

It's possible that you happened to have other workers run before the multiplex workers, which would make the request id never be 0 for a multiplex worker. The request id counter is shared between all kinds of workers, so that's quite possible.

larsrc-google · 2020-10-15T16:43:57Z

BTW, it looks like the flakes mentioned by @philwo aren't happening any more.

larsrc-google · 2020-11-11T11:37:37Z

@borkaehw : I've been poring over the code a lot for refactoring and integrating with dynamic execution, and I have two questions of a Chesterton's Fence nature:

Was there a specific reason you made WorkerMultiplexer.getResponse() return an InputStream instead of the already-parsed WorkResponse, or was it just an implementation accident?
Why did you use regular maps with semaphores around them rather than ConcurrentHashMap or similar?

borkaehw · 2020-11-11T18:57:16Z

It might be simply because the regular worker also parses InputStream in Worker.java (try to be consistent) and there is an extra check for null before parsing in WorkerProxy.java (feel like it's a good place to check).

if (inputStream == null) {
    return null;
}
return WorkResponse.parseDelimitedFrom(inputStream);

But I don't have a strong opinion that it should be in the current way. Feel free to make any improvement.

I didn't think too much on that, ConcurrentHashMap looks like a better alternative. Feel free to change.

Thanks for reaching out!

larsrc-google · 2020-11-12T08:57:59Z

Thank you. As you can see in the current WorkerProxy, changing away from passing an InputStream simplified the code a lot. I'll have to consider more carefully the impact of using ConcurrentHashMap, especially together with dynamic execution.

borkaehw requested a review from philwo as a code owner December 6, 2018 21:54

googlebot added the cla: yes label Dec 6, 2018

jin assigned philwo Dec 10, 2018

jin added the team-Execution label Dec 10, 2018

borkaehw force-pushed the multiplexer branch 3 times, most recently from 33afd6b to ecc8854 Compare December 13, 2018 17:44

jjudd mentioned this pull request Dec 19, 2018

Multiplex persistent worker protocol #2832

Closed

jin added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Execution labels Jan 14, 2019

philwo requested review from jmmv and rupertks and removed request for philwo January 15, 2019 19:09

philwo assigned jmmv and rupertks and unassigned philwo Jan 15, 2019

philwo added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jan 15, 2019

borkaehw added 2 commits September 26, 2019 12:24

Fix hanging multiplexer

92e9f8e

Prevent null

eef69d1

bazel-io closed this in 6d1b972 Oct 14, 2019

borkaehw mentioned this pull request Oct 25, 2019

Multiplex Workers documentation #10108

Closed

bazel-io pushed a commit that referenced this pull request Nov 12, 2019

Multiplex Workers documentation

ac04cd9

A documentation for Multiplex Workers. See the original PR for details. #6857 Fix: #2832 Closes #10108. PiperOrigin-RevId: 279950386

SrodriguezO mentioned this pull request Nov 21, 2019

Bazel occasionally stuck during execution when using multiplexed workers #10288

Closed

borkaehw deleted the multiplexer branch December 20, 2019 18:04

Send and receive work requests via proxy and multiplexer #6857

Send and receive work requests via proxy and multiplexer #6857

Conversation

borkaehw commented Dec 6, 2018 • edited Loading

TODOs

jjudd commented Dec 19, 2018

jjudd commented Jan 8, 2019

jin commented Jan 14, 2019

johnynek commented Jan 15, 2019

philwo commented Jan 15, 2019

jin commented Jan 15, 2019 • edited Loading

philwo commented Jan 15, 2019 • edited Loading

johnynek commented Jan 15, 2019

jin commented Jan 15, 2019 • edited Loading

philwo commented Jan 15, 2019

jin commented Jan 15, 2019

laurentlb commented Jan 15, 2019 • edited Loading

SrodriguezO commented Sep 4, 2019

philwo commented Sep 9, 2019

borkaehw commented Oct 2, 2019

philwo commented Oct 2, 2019

philwo commented Oct 14, 2019

borkaehw commented Oct 14, 2019

philwo commented Oct 15, 2019

borkaehw commented Oct 15, 2019

philwo commented Oct 21, 2019

philwo commented Oct 21, 2019

borkaehw commented Oct 21, 2019

borkaehw commented Oct 25, 2019

larsrc-google commented Jul 27, 2020

borkaehw commented Jul 27, 2020

larsrc-google commented Jul 28, 2020 via email

SrodriguezO commented Aug 19, 2020

larsrc-google commented Aug 19, 2020

SrodriguezO commented Aug 26, 2020

larsrc-google commented Aug 26, 2020

larsrc-google commented Oct 15, 2020

larsrc-google commented Nov 11, 2020

borkaehw commented Nov 11, 2020

larsrc-google commented Nov 12, 2020

borkaehw commented Dec 6, 2018 •

edited

Loading

jin commented Jan 15, 2019 •

edited

Loading

philwo commented Jan 15, 2019 •

edited

Loading

jin commented Jan 15, 2019 •

edited

Loading

laurentlb commented Jan 15, 2019 •

edited

Loading