Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplex persistent worker protocol #2832

Closed
jart opened this issue Apr 16, 2017 · 32 comments
Closed

Multiplex persistent worker protocol #2832

jart opened this issue Apr 16, 2017 · 32 comments
Assignees
Labels
category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: feature request

Comments

@jart
Copy link
Contributor

jart commented Apr 16, 2017

Background

Bazel spawns 4 persistent workers processes and sends them requests in serial via stdin/stdout.

Requirements

  • Option to spawn 1 process instead of 4 in ctx.action execution_requirements.
  • Ability to handle multiple requests simultaneously

Justification

  • JVM has high memory overhead.
  • CacheBuilder and SoftReference turn Java GC into super fast LRU cache for ASTs.
  • Why have 4 caches?

Design No. 1: Multiplex

  • Continue using stdin/stdout
  • Add request_id field to worker protocol request and response message protos
  • Add is_trying boolean field to response proto, sort of like 100 Trying in SIP
  • Bazel can send another request in parallel if it gets an is_trying response

Design No. 2: TCP

  • Add upgrade field to response proto that redirects Bazel to a HostAndPort.
  • All future requests get sent there via TCP
  • Don't use gRPC just send the raw protos
  • Maybe allow multiple requests per socket

CC: @lberki, @meisterT with whom I socialized idea offline IIRC

@buchgr
Copy link
Contributor

buchgr commented Apr 18, 2017

Don't use gRPC just send the raw protos
Maybe allow multiple requests per socket

Those two seem contradicting. That is, if you want multiple requests per socket you again need to implement something like gRPC.

@lberki
Copy link
Contributor

lberki commented Apr 18, 2017

Do you hava data on the costs of not implementing this? Sure, it can be done, but it's yet another feature we have to support so I'd rather it pays for its rent.

@abergmeier-dsfishlabs
Copy link
Contributor

Why not have a frontend-process, which delegates to one background-process? Why does it have to be implemented on the Bazel level?

@philwo
Copy link
Member

philwo commented Apr 18, 2017

Fun fact, a multiplexing TCP-based version of this was implemented in the very first version of persistent workers and worked perfectly fine with a multi-threaded version of the JavaBuilder worker - but was then deemed unnecessary complexity by me and teammates and the code was deleted (AFAIR without even submitting the CL, so we can't restore it from history, ouch) and replaced with the simpler, serialized, multi-process stdin/stdout mechanism. :| Maybe we should have gone with the more complex version in the first place. Hindsight is best sight.

I'll have a look at this! Thanks for writing this proposal down so cleanly.

@philwo philwo added the P2 We'll consider working on this in future. (Assignee optional) label Apr 18, 2017
@jart
Copy link
Contributor Author

jart commented Apr 18, 2017

@philwo My pleasure. Did you consider using the multiplexing technique described in Design No. 1? That would avoid the socket complexity and should hopefully be pretty straightforward. The user could continue doing things the simple way if he wants

@lberki It's far too easy to accidentally link the wrong thing in our internal repo and end up with so many jars that the JVM takes up gigs of memory. The JVM is amazing at threads and garbage collection so it makes sense to me to utilize those strengths, just like Bazel does.

@abergmeier-dsfishlabs The requests would still get sent to the frontend in parallel. Maybe the four frontends could scheme together to launch a single backend, possibly by locking a single input file, but I'm not sure if Bazel would consider that hermetic.

@pauldraper
Copy link
Contributor

pauldraper commented Sep 4, 2018

Why not have a frontend-process, which delegates to one background-process? Why does it have to be implemented on the Bazel level?

Because that would be really easy to mess up. Who would start it? Who would stop it? Who would check to see if files have changed?

Do you hava data on the costs of not implementing this? Sure, it can be done, but it's yet another feature we have to support so I'd rather it pays for its rent.

Lucid Software has seen a massive memory regression in transitioning from sbt (Scala) to Bazel. sbt used a single JVM process, and so it could JIT the Scala compiler once and have reasonable memory overhead. Then Bazel says, "Hey, if you want the same performance you had before, take your machine apart cram a bunch more RAM into it, and start 8x the number of processes each doing the exact same JIT.

As @jart said, it's insane to have multiple local caches. And the only reason to use workers is to cache things, no? (Typically caching JIT, sometimes just caching loading the executable, and I suppose you can get fancier.) Is there any situation that wouldn't used significantly less memory with this proposal?

@buchgr
Copy link
Contributor

buchgr commented Sep 4, 2018

Is there any situation that wouldn't used significantly less memory with this proposal

For compilers that don't supported multi-threaded compilation it should neither be a win nor a miss. I remember talking to @philwo offline and I believe we agreed that it's a good idea to support your use case. Would you be interested in working on this @pauldraper ?

@pauldraper
Copy link
Contributor

For compilers that don't supported multi-threaded compilation

That's true. Say, Node.js-based compilers.

Would you be interested in working on this @pauldraper ?

I'm no longer working with this, but @jjudd may be interested.

@pauldraper
Copy link
Contributor

pauldraper commented Sep 4, 2018

My 2c though: I suppose TCP would help the #4897 issues. But I like the simplicity of stdin/stdout (even when multiplexed). And not fiddling with Nagle, etc.

FWIW, the apt transport protocol is multiplexed stdin/stdout with an executable.

@jjudd
Copy link
Contributor

jjudd commented Sep 4, 2018

We are definitely interested in this. Launching lots of JVMs consumes lots of resources.

I'm not sure when we will have time to work on it, but it is something we are interested in.

@buchgr do you have a ballpark estimate of how large of an effort you think developing this would be? I lack enough context on the Bazel codebase to effectively tell if it is a 2 day, 2 week, or 2 month effort.

@buchgr
Copy link
Contributor

buchgr commented Sep 4, 2018

@buchgr do you have a ballpark estimate of how large of an effort you think developing this would be? I lack enough context on the Bazel codebase to effectively tell if it is a 2 day, 2 week, or 2 month effort.

As the original author of the workers feature @philwo should be able to answer this question best and provide guidance.

@jin
Copy link
Member

jin commented Sep 4, 2018

/sub

@jjudd
Copy link
Contributor

jjudd commented Sep 17, 2018

@philwo friendly ping. In your opinion how large of a task is this? Hours, days, weeks, months?

@philwo
Copy link
Member

philwo commented Sep 27, 2018

I think it shouldn't take long - days for a first prototype, weeks for a first complete version maybe? This would only concern the Bazel side though, I can't speak about updating existing workers to take advantage of the new protocol.

All the worker related code in Bazel is concentrated here: https://source.bazel.build/bazel/+/master:src/main/java/com/google/devtools/build/lib/worker/ - so you don't need much context about how Bazel works.

There's an integration test, too: https://source.bazel.build/bazel/+/master:src/test/shell/integration/bazel_worker_test.sh

Regarding the protocol, I'm open to whatever you'd come up with that works well and is easily integrated into various languages out there. I think I've seen persistent workers written in Java, JavaScript, TypeScript, Dart so far.

@cushon (Java), @mprobst (TypeScript) and @davidmorgan (Dart) might want to comment on this with their ideas / wishes. :)

@davidmorgan
Copy link

From the Dart side: parallel requests in one worker isn't super exciting since Dart is single threaded. We're planning on experimenting with build performance in Q4, we might have some suggestions for worker protocol changes of our own. (Not super high probability, though; 20% maybe).

@mprobst
Copy link
Contributor

mprobst commented Sep 27, 2018 via email

@cushon
Copy link
Contributor

cushon commented Sep 27, 2018

@kevin1e100 might be interested in this for kotlin.

Javac is single-threaded, but I think this would allow us to run multiple instances of it in one worker and share a cache and memory footprint, instead of having multiple workers which starts to use a lot of memory and means any caching takes a long time to work up. How do Dart and TypeScript avoid those issues with the current approach?

@kevin1e100
Copy link
Contributor

Right even with single-threaded underlying tools, if they can safely be run in parallel, that can still be a win I would think. But from my point of view this really shines when the worker wants to do some kind of caching (example below) or incremental scheme (e.g., Java compilation is typically incremental in the Eclipse IDE IIUC). Bazel's DexBuilder worker for Android builds for instance uses caching but as @cushon mentioned all worker instances have their own cache, which can be unfortunate.

@jjudd
Copy link
Contributor

jjudd commented Sep 27, 2018

Thanks for the estimate @philwo. We are starting work on this. @borkaehw is leading the implementation our end. We'll keep people updated as we make progress, propose designs, etc.

@buchgr
Copy link
Contributor

buchgr commented Sep 28, 2018

@jjudd it would be great if you could share a design document with bazel-discuss / bazel-dev before doing the implementation. We are happy to review it and give pointers! Thanks so much!

@davidmorgan
Copy link

@cushon users of bazel+workers+Dart are google internal--we just use a lot of RAM.

@ittaiz
Copy link
Member

ittaiz commented Sep 28, 2018 via email

@cushon
Copy link
Contributor

cushon commented Sep 28, 2018

@davidmorgan are you using the workers for caching/incrementality, or mostly to keep a VM warm? If you're using caching, have seen issues with the hit rate from having a separate cache for each worker instance?

@davidmorgan
Copy link

@cushon Right now mostly to keep a VM warm. We hope to gain more from caching/incrementality in future.

borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 1, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 3, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 3, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 4, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 4, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 4, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 5, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 5, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 11, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 15, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 22, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 26, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Apr 29, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue May 7, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue May 17, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue May 30, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue May 30, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Aug 6, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Aug 6, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Aug 6, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Aug 21, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Aug 21, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
SrodriguezO pushed a commit to lucidsoftware/bazel that referenced this issue Sep 4, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Sep 12, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Sep 12, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
borkaehw added a commit to lucidsoftware/bazel that referenced this issue Sep 12, 2019
For each unique WorkerKey, Bazel can launch a multiplexer to talk to one multi-threaded worker process optionally. We use less JVM processes but maintain the approximately same performance, hence, save more memory. The worker process should be able to handle multiple requests to fully utilize this feature.

Fix: bazelbuild#2832
bazel-io pushed a commit that referenced this issue Oct 14, 2019
This is the attempt to solve issue #2832.
The design doc has been approved [Multiplex persistent worker](https://docs.google.com/document/d/1OC0cVj1Y1QYo6n-wlK6mIwwG7xS2BJX7fuvf1r5LtnU/edit?usp=sharing).

Two minor design changes from design doc
- Number of WorkerProxy is still limited by `--worker_max_instances`.
- We merge worker multiplexer sender and receiver to one WorkerMultiplxer, WorkerProxy sends request to worker process directly.

Closes #6857.

PiperOrigin-RevId: 274560006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: sandboxing P2 We'll consider working on this in future. (Assignee optional) type: feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.