Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send and receive work requests via proxy and multiplexer #6857

Closed
wants to merge 34 commits into from

Conversation

borkaehw
Copy link
Contributor

@borkaehw borkaehw commented Dec 6, 2018

This is the attempt to solve issue #2832.
The design doc has been approved Multiplex persistent worker.

Two minor design changes from design doc

  • Number of WorkerProxy is still limited by --worker_max_instances.
  • We merge worker multiplexer sender and receiver to one WorkerMultiplxer, WorkerProxy sends request to worker process directly.

TODOs

  • Add tests.
  • Fix the issue that multiplexer silences unparseable responses from worker. Changes required in WorkerMultiplexer.java run().
  • WorkerMultiplexer.waitResponse sometimes throws NP exceptions.

@jjudd
Copy link
Contributor

jjudd commented Dec 19, 2018

Friendly ping here.

@jjudd
Copy link
Contributor

jjudd commented Jan 8, 2019

Friendly ping on this @philwo

@jin jin added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Execution labels Jan 14, 2019
@jin
Copy link
Member

jin commented Jan 14, 2019

Ping @philwo

@johnynek
Copy link
Member

is the review SLA like 2 months or something? It is really disheartening to watch these PRs go by with the only traffic being "ping, ping, ping..."

@philwo
Copy link
Member

philwo commented Jan 15, 2019

Sorry for the delay.

I was on parental leave until January 2nd, came back to the office and found myself firefighting a broken Buildkite CI, which I had to fix. This took until today.

Dear people - if you don’t get a response from someone, maybe ping someone else? It’s not like we don’t want to review PRs, but sometimes the assignee just can’t.

@philwo philwo requested review from jmmv and rupertks and removed request for philwo January 15, 2019 19:09
@philwo philwo assigned jmmv and rupertks and unassigned philwo Jan 15, 2019
@philwo philwo added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jan 15, 2019
@jin
Copy link
Member

jin commented Jan 15, 2019

@philwo I was under the impression that you are the reviewer for this particular design and implementation, and it was not clear if you wanted to hand this review over to someone else.

This brings up a discussion point of setting expectations w.r.t OOO. Sometimes people are OOO / on leave, but GitHub didn't have a good way to publicly announce that until recently.

@philwo
Copy link
Member

philwo commented Jan 15, 2019

@jin Yes, the problem is not that I didn't want to review this, the problem was that I couldn't do so and wasn't even aware of the pings until a few minutes ago due to some wrong settings on my side. I'm really sorry for that.

In this particular example, a couple of things went wrong, but luckily it's easy to do better:

  • I was out of office until recently and we had no fallback for this case. It's always possible that a single person is on longer vacation, leave or just overloaded with work. AI: Maybe we should always require a second reviewer, or we should have a way for people to escalate to someone in the team who knows the bigger picture of everyone's availability and can make a call in cases like this. I'll also take care to set my GitHub out-of-office settings when going on vacation and I'll teach others in the team about that new feature.
  • The "code owners" feature assigned this to me and only me. AI: We should ensure that there are at least two people for every component in our owners configuration. Or create GitHub teams that actually match our team structure and figure out if we can get the code owners feature to assign work to teams instead of individual people.
  • My GitHub notifications got lost in the pile of work e-mails I get everyday, so I didn't even see the pings until now. Sorry! AI: I fixed that by setting better notification settings, Gmail filters and labels now, so this shouldn't happen again, but in general, if I don't seem to respond on one channel, feel free to ping me over e-mail, chat. I might just accidentally not see your pings. I'll also take care to go through my notification mails regularly to check that I didn't miss something.

@johnynek
Copy link
Member

@jin assigned this to @philwo on 12/10. I assumed he had state since you are both at google. If two googlers have trouble keeping the state, how can external people know that we should have timed out, and somehow find a different googler to ping?

Can I suggest that you all put someone on triage for PRs, maybe a rotating position (even each week) to make sure PRs have been replied to/are properly assigned and have some ETA if there will be significant waits (e.g. in this case the reviewer was on paternity leave so we could have been informed there was an OOO-situation that would delay this for 1-2 months?)

@jin
Copy link
Member

jin commented Jan 15, 2019

@philwo those are terrific suggestions, and we should definitely implement them. I'll bring this up in the next DevEx gardener meeting. I also want to apologize for not reaching out to you on other channels, or ask around to see if there's anyone else available to review this.

We can also augment the design review process to always require two reviewers (one primary, one secondary), instead of one lead reviewer. cc @laurentlb

Can I suggest that you all put someone on triage for PRs, maybe a rotating position (even each week) to make sure PRs have been replied to/are properly assigned and have some ETA if there will be significant waits (e.g. in this case the reviewer was on paternity leave so we could have been informed there was an OOO-situation that would delay this for 1-2 months?)

@johnynek actually, this is happening right now, but we're in the stage of figuring out a good set of SLOs for issue and PR responses. We shouldn't bottleneck reviews on a single person, as @philwo suggested.

@philwo
Copy link
Member

philwo commented Jan 15, 2019

@jin Thanks so much! Let's discuss some ideas and action items in the meeting. I'm happy to volunteer to write up a doc with the stuff we come up with, I guess that's the least I can do after my embarrassing mess-up with the Gmail filters hiding all these notification e-mails 😔

Then we can distribute the knowledge about working efficiently with GitHub in each of our offices and make sure that we follow up on the AIs, to make sure that this doesn't happen again... WDYT?

@jin
Copy link
Member

jin commented Jan 15, 2019

@philwo that works, let's take this offline from this thread :-) thanks!

@laurentlb
Copy link
Contributor

laurentlb commented Jan 15, 2019

We updated the Contributing page a month ago, it now says:

  1. Wait for a Bazel team member to assign you a reviewer. It should be done in 2
    business days (excluding holidays in the USA and Germany). If you do not get
    a reviewer within that time frame, you can ask for one by sending a mail to
    [email protected].
  2. Work with the reviewer to complete a code review. For each change, create a
    new commit and push it to make changes to your pull request. If the review
    takes too long (e.g. reviewer is unresponsive), please also send an email to
    [email protected].

As mentioned above, we should improve the process so that it doesn't happen anymore. But if you have any other bad experience, it's useful to report it. Even a week of waiting with no news is way too long.

@SrodriguezO
Copy link
Contributor

Force push was just rebasing on master

@philwo
Copy link
Member

philwo commented Sep 9, 2019

Thank you! Don't know about @jmmv's busyness, but I'll probably have time to have a look at this next week Monday.

@borkaehw
Copy link
Contributor Author

borkaehw commented Oct 2, 2019

NP exceptions issue has been fixed.

@philwo
Copy link
Member

philwo commented Oct 2, 2019

Thank you, @borkaehw! I reviewed the code today and will try to import it when I’m back in the office.

@philwo
Copy link
Member

philwo commented Oct 14, 2019

There's one test failure that I get repeatedly on presubmits. I'm going to disable that test:

** test_build_fails_when_worker_returns_junk ***********************************
$TEST_TMPDIR defined: output root default is '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151' and max_idle_secs default is '15'.
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Writing tracer profile to '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/command.profile.gz'
-- Test log: -----------------------------------------------------------
$TEST_TMPDIR defined: output root default is '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151' and max_idle_secs default is '15'.
INFO: Writing tracer profile to '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/command.profile.gz'
Loading: 
Loading: 0 packages loaded
Analyzing: target //testvgyDnT:hello_world_2 (0 packages loaded, 0 targets configured)
INFO: Analyzed target //testvgyDnT:hello_world_2 (0 packages loaded, 1 target configured).
INFO: Found 1 target...
INFO: SpawnActionContextMap: "" = [WorkerSpawnStrategy, StandaloneSpawnStrategy]
INFO: ContextMap: Context = BazelWorkspaceStatusActionContext
INFO: ContextMap: CppIncludeExtractionContext = DummyCppIncludeExtractionContext
INFO: ContextMap: CppIncludeScanningContext = DummyCppIncludeScanningContext
INFO: ContextMap: FileWriteActionContext = FileWriteStrategy
INFO: ContextMap: SpawnActionContext = ProxySpawnActionContext
INFO: ContextMap: SpawnCache = NoSpawnCache
INFO: ContextMap: SymlinkTreeActionContext = SymlinkTreeStrategy
INFO: ContextMap: TemplateExpansionContext = LocalTemplateExpansionStrategy
INFO: ContextMap: TestActionContext = ExclusiveTestStrategy
[0 / 3] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (2 actions, 1 running)
SUBCOMMAND: # //testvgyDnT:hello_world_2 [action 'Working on hello_world_2', configuration: 9a088452797200c5bf92e0469646fb64]
(cd /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/execroot/main && \
  exec env - \
  bazel-out/host/bin/testvgyDnT/worker '--poison_after=1' @bazel-out/k8-fastbuild/bin/testvgyDnT/hello_world_2_worker_input)
ERROR: A crash occurred while bazel was trying to handle a crash! Please file a bug against bazel and include the information below.
Original uncaught exception:
java.lang.NullPointerException
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:198)
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:207)
Exception encountered during BugReport#handleCrash:
java.lang.NullPointerException
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:198)
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:207)

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/sandbox/linux-sandbox/3802/execroot/io_bazel/_tmp/68faad8d76e456691f290fd9eb93d151/root/2472b9ba977ce049a016a5b3467b355c/server/jvm.out')

@bazel-io bazel-io closed this in 6d1b972 Oct 14, 2019
@borkaehw
Copy link
Contributor Author

@philwo Thank you so much. I am happy to address related issues in the future.

I believe it's still an optional feature, should I draft a doc for how to turn it on?

@philwo
Copy link
Member

philwo commented Oct 15, 2019

@borkaehw Documentation is always welcome, so yes, thank you :)

@borkaehw
Copy link
Contributor Author

I am not sure about the steps here. Where and How can I contribute to the documentation?

@philwo
Copy link
Member

philwo commented Oct 21, 2019

@borkaehw I think you can create a new Markdown file, e.g. persistent-workers.md, here:
https://github.com/bazelbuild/bazel/blob/master/site/docs/

We can then find out how to wire it up so that it shows in the menu - I'm also not 100% sure how this currently works. ;)

Maybe a good example would be this one:
https://github.com/bazelbuild/bazel/blob/master/site/docs/build-event-protocol.md

Here's how it looks on the website:
https://docs.bazel.build/versions/master/build-event-protocol.html

@philwo
Copy link
Member

philwo commented Oct 21, 2019

It seems like the bazel_worker_multiplexer_test is still flaky, at least this one times out regularly in our internal Google CI: "test_multiple_target_with_delay".

Example log:

[14 / 15] Working on hello_world_3; 4s worker
[14 / 15] Working on hello_world_3; 15s worker
[14 / 15] Working on hello_world_3; 35s worker
[14 / 15] Working on hello_world_3; 50s worker
[14 / 15] Working on hello_world_3; 83s worker
[14 / 15] Working on hello_world_3; 105s worker
[14 / 15] Working on hello_world_3; 130s worker
[14 / 15] Working on hello_world_3; 158s worker
[14 / 15] Working on hello_world_3; 191s worker
[14 / 15] Working on hello_world_3; 229s worker
[14 / 15] Working on hello_world_3; 273s worker
[14 / 15] Working on hello_world_3; 323s worker
[14 / 15] Working on hello_world_3; 380s worker
[14 / 15] Working on hello_world_3; 447s worker
[14 / 15] Working on hello_world_3; 523s worker
[14 / 15] Working on hello_world_3; 610s worker

Blaze caught terminate signal; shutting down.

------------------------------------------------------------------------
test_multiple_target_with_delay FAILED: terminated by signal TERM.
third_party/bazel/src/test/shell/integration/bazel_worker_multiplexer_test:549: in call to main

@borkaehw
Copy link
Contributor Author

@philwo I see. It seems like response message got lost so worker was waiting for response indefinitely. Do you have idea how could this happen? Do we need to disable more tests?

@borkaehw
Copy link
Contributor Author

@philwo PR for documentation has been added. Thanks. #10108

bazel-io pushed a commit that referenced this pull request Nov 12, 2019
A documentation for Multiplex Workers. See the original PR for details. #6857

Fix: #2832

Closes #10108.

PiperOrigin-RevId: 279950386
@borkaehw borkaehw deleted the multiplexer branch December 20, 2019 18:04
@larsrc-google
Copy link
Contributor

Hi borkaehw, I recently joined the Bazel team and I'm picking up work on workers. I did a simple multi-threaded version of our internal JavaBuilder, and was surprised to find it performed worse in my benchmarks than having multiple separate workers. Did you get the hoped-for speed/RAM usage improvement out of your multiplex workers?

@borkaehw
Copy link
Contributor Author

We expect it to use less RAM since it should spawn less JVMs, and should have almost the same speed. We implemented it to solve the massive memory usage of Bazel. How worse it performed?

@larsrc-google
Copy link
Contributor

larsrc-google commented Jul 28, 2020 via email

@SrodriguezO
Copy link
Contributor

Hey @larsrc-google,

This sounds concerning, but we are unable to reproduce the behavior you're observing. We tried building our Scala services with and without multiplex workers, and we noticed a ~10% performance improvement when we use multiplex workers (without caching and accross 3 trials).

We used the following higherkindness/rules_scala commits for these benchmarks:

  • multiplex workers enabled: 57949194155958241ffd3aaff0f275b3fee5ee62
  • multiplex workers disabled: 8674ae18f3e0be846a5e68d34d152da412c27529

The only difference being the multiplex worker revert on 8674ae18f3e0be846a5e68d34d152da412c27529

We ran these against our own codebase, so we cannot share the source we built. Do you have a sample and/or OSS repo you're using for your investigation we could check out?

@larsrc-google
Copy link
Contributor

I believe my commit 7be7aed fix the problem I was seeing - having a request id 0 made that request blocking, so no more multiplex request could be started until that one finished. Depending on the distribution of requests, that could make it a lot slower.

@SrodriguezO
Copy link
Contributor

Interesting. We don't have that commit on the version of Bazel we're using (3.3), but we still couldn't reproduce the issue. I tried patching Bazel 3.3 with your change to see if we noticed any performance improvements, but performance is the same as before with multiplex workers enabled 🤔

@larsrc-google
Copy link
Contributor

It's possible that you happened to have other workers run before the multiplex workers, which would make the request id never be 0 for a multiplex worker. The request id counter is shared between all kinds of workers, so that's quite possible.

@larsrc-google
Copy link
Contributor

BTW, it looks like the flakes mentioned by @philwo aren't happening any more.

@larsrc-google
Copy link
Contributor

@borkaehw : I've been poring over the code a lot for refactoring and integrating with dynamic execution, and I have two questions of a Chesterton's Fence nature:

  1. Was there a specific reason you made WorkerMultiplexer.getResponse() return an InputStream instead of the already-parsed WorkResponse, or was it just an implementation accident?

  2. Why did you use regular maps with semaphores around them rather than ConcurrentHashMap or similar?

@borkaehw
Copy link
Contributor Author

  1. It might be simply because the regular worker also parses InputStream in Worker.java (try to be consistent) and there is an extra check for null before parsing in WorkerProxy.java (feel like it's a good place to check).
if (inputStream == null) {
    return null;
}
return WorkResponse.parseDelimitedFrom(inputStream);

But I don't have a strong opinion that it should be in the current way. Feel free to make any improvement.

  1. I didn't think too much on that, ConcurrentHashMap looks like a better alternative. Feel free to change.

Thanks for reaching out!

@larsrc-google
Copy link
Contributor

Thank you. As you can see in the current WorkerProxy, changing away from passing an InputStream simplified the code a lot. I'll have to consider more carefully the impact of using ConcurrentHashMap, especially together with dynamic execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes team-Local-Exec Issues and PRs for the Execution (Local) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.