-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send and receive work requests via proxy and multiplexer #6857
Conversation
33afd6b
to
ecc8854
Compare
Friendly ping here. |
Friendly ping on this @philwo |
Ping @philwo |
is the review SLA like 2 months or something? It is really disheartening to watch these PRs go by with the only traffic being "ping, ping, ping..." |
Sorry for the delay. I was on parental leave until January 2nd, came back to the office and found myself firefighting a broken Buildkite CI, which I had to fix. This took until today. Dear people - if you don’t get a response from someone, maybe ping someone else? It’s not like we don’t want to review PRs, but sometimes the assignee just can’t. |
@philwo I was under the impression that you are the reviewer for this particular design and implementation, and it was not clear if you wanted to hand this review over to someone else. This brings up a discussion point of setting expectations w.r.t OOO. Sometimes people are OOO / on leave, but GitHub didn't have a good way to publicly announce that until recently. |
@jin Yes, the problem is not that I didn't want to review this, the problem was that I couldn't do so and wasn't even aware of the pings until a few minutes ago due to some wrong settings on my side. I'm really sorry for that. In this particular example, a couple of things went wrong, but luckily it's easy to do better:
|
@jin assigned this to @philwo on 12/10. I assumed he had state since you are both at google. If two googlers have trouble keeping the state, how can external people know that we should have timed out, and somehow find a different googler to ping? Can I suggest that you all put someone on triage for PRs, maybe a rotating position (even each week) to make sure PRs have been replied to/are properly assigned and have some ETA if there will be significant waits (e.g. in this case the reviewer was on paternity leave so we could have been informed there was an OOO-situation that would delay this for 1-2 months?) |
@philwo those are terrific suggestions, and we should definitely implement them. I'll bring this up in the next DevEx gardener meeting. I also want to apologize for not reaching out to you on other channels, or ask around to see if there's anyone else available to review this. We can also augment the design review process to always require two reviewers (one primary, one secondary), instead of one lead reviewer. cc @laurentlb
@johnynek actually, this is happening right now, but we're in the stage of figuring out a good set of SLOs for issue and PR responses. We shouldn't bottleneck reviews on a single person, as @philwo suggested. |
@jin Thanks so much! Let's discuss some ideas and action items in the meeting. I'm happy to volunteer to write up a doc with the stuff we come up with, I guess that's the least I can do after my embarrassing mess-up with the Gmail filters hiding all these notification e-mails 😔 Then we can distribute the knowledge about working efficiently with GitHub in each of our offices and make sure that we follow up on the AIs, to make sure that this doesn't happen again... WDYT? |
@philwo that works, let's take this offline from this thread :-) thanks! |
We updated the Contributing page a month ago, it now says:
As mentioned above, we should improve the process so that it doesn't happen anymore. But if you have any other bad experience, it's useful to report it. Even a week of waiting with no news is way too long. |
Force push was just rebasing on master |
Thank you! Don't know about @jmmv's busyness, but I'll probably have time to have a look at this next week Monday. |
NP exceptions issue has been fixed. |
Thank you, @borkaehw! I reviewed the code today and will try to import it when I’m back in the office. |
There's one test failure that I get repeatedly on presubmits. I'm going to disable that test:
|
@philwo Thank you so much. I am happy to address related issues in the future. I believe it's still an optional feature, should I draft a doc for how to turn it on? |
@borkaehw Documentation is always welcome, so yes, thank you :) |
I am not sure about the steps here. Where and How can I contribute to the documentation? |
@borkaehw I think you can create a new Markdown file, e.g. We can then find out how to wire it up so that it shows in the menu - I'm also not 100% sure how this currently works. ;) Maybe a good example would be this one: Here's how it looks on the website: |
It seems like the bazel_worker_multiplexer_test is still flaky, at least this one times out regularly in our internal Google CI: "test_multiple_target_with_delay". Example log:
|
@philwo I see. It seems like response message got lost so worker was waiting for response indefinitely. Do you have idea how could this happen? Do we need to disable more tests? |
Hi borkaehw, I recently joined the Bazel team and I'm picking up work on workers. I did a simple multi-threaded version of our internal JavaBuilder, and was surprised to find it performed worse in my benchmarks than having multiple separate workers. Did you get the hoped-for speed/RAM usage improvement out of your multiplex workers? |
We expect it to use less RAM since it should spawn less JVMs, and should have almost the same speed. We implemented it to solve the massive memory usage of Bazel. How worse it performed? |
I saw slowdowns of a factor 2, more if I used many multiplex workers. But I
am starting to find out why.
…-Lars
|
Hey @larsrc-google, This sounds concerning, but we are unable to reproduce the behavior you're observing. We tried building our Scala services with and without multiplex workers, and we noticed a ~10% performance improvement when we use multiplex workers (without caching and accross 3 trials). We used the following higherkindness/rules_scala commits for these benchmarks:
The only difference being the multiplex worker revert on 8674ae18f3e0be846a5e68d34d152da412c27529 We ran these against our own codebase, so we cannot share the source we built. Do you have a sample and/or OSS repo you're using for your investigation we could check out? |
I believe my commit 7be7aed fix the problem I was seeing - having a request id 0 made that request blocking, so no more multiplex request could be started until that one finished. Depending on the distribution of requests, that could make it a lot slower. |
Interesting. We don't have that commit on the version of Bazel we're using (3.3), but we still couldn't reproduce the issue. I tried patching Bazel 3.3 with your change to see if we noticed any performance improvements, but performance is the same as before with multiplex workers enabled 🤔 |
It's possible that you happened to have other workers run before the multiplex workers, which would make the request id never be 0 for a multiplex worker. The request id counter is shared between all kinds of workers, so that's quite possible. |
BTW, it looks like the flakes mentioned by @philwo aren't happening any more. |
@borkaehw : I've been poring over the code a lot for refactoring and integrating with dynamic execution, and I have two questions of a Chesterton's Fence nature:
|
But I don't have a strong opinion that it should be in the current way. Feel free to make any improvement.
Thanks for reaching out! |
Thank you. As you can see in the current WorkerProxy, changing away from passing an InputStream simplified the code a lot. I'll have to consider more carefully the impact of using ConcurrentHashMap, especially together with dynamic execution. |
This is the attempt to solve issue #2832.
The design doc has been approved Multiplex persistent worker.
Two minor design changes from design doc
--worker_max_instances
.TODOs
WorkerMultiplexer.java
run()
.