Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression from 6.4.0 -> 7 #21828

Closed
joeljeske opened this issue Mar 27, 2024 · 14 comments
Closed

Performance regression from 6.4.0 -> 7 #21828

joeljeske opened this issue Mar 27, 2024 · 14 comments
Labels
awaiting-user-response Awaiting a response from the author P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process)

Comments

@joeljeske
Copy link
Contributor

Description of the bug:

In bumping bazel to 7.x from 6.4.0, I've encountered significant performance regressions during my build. At the start of the build, Bazel appears to be progressing quickly, and consumes all available --jobs with remote actions. Within ~45min, Bazel is only able to execute 0 or 1 action remotely, and the rest are stuck in [Prepa] for a long time. I do not observe this behavior on 6.4.0, but I can consistently reproduce in 7.0.1 and 7.1.1.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Ubuntu 20.04

What is the output of bazel info release?

release 7.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

Potentially related to #21626

Any other information, logs, or outputs that you want to share?

2024-03-26 21:28:30 CDT	(02:28:30) [755,647 / 761,751] 16720 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2857s ... (200 actions, 0 running)
2024-03-26 21:29:58 CDT	(02:29:58) [756,389 / 762,222] 16768 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2945s ... (200 actions, 2 running)
2024-03-26 21:31:04 CDT	(02:31:04) [764,176 / 769,524] 17127 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3011s ... (200 actions, 1 running)
2024-03-26 21:32:04 CDT	(02:32:04) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3071s ... (200 actions, 0 running)
2024-03-26 21:33:35 CDT	(02:33:35) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3161s ... (200 actions, 0 running)

Some relevant flags

--jobs 200
--remote_download_minimal
--experimental_throttle_remote_action_building  # retained default value
--bes_backend=grpc://127.0.0.1:5555
--bes_upload_mode=FULLY_ASYNC
--remote_timeout=3600
--experimental_remote_merkle_tree_cache
--experimental_remote_merkle_tree_cache_size=1000
--show_progress_rate_limit=60

Thread state: threads.txt

@github-actions github-actions bot added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 27, 2024
@tjgq
Copy link
Contributor

tjgq commented Mar 27, 2024

--experimental_remote_merkle_tree_cache_size=1000 is potentially too small (see #18686). Can you bump that up to something like 10000 and see if the issue goes away? (Unfortunately it's difficult to determine a good value; that's one of several problems with the Merkle tree cache as explained at length in #21378.)

@joeljeske
Copy link
Contributor Author

Understood, I will try that. From that discussion, should I potentially disable the merkle cache altogether?

Also related, I am still observing memory leaks and OOMs as tracked in #16913 on 6.4.0. I have been unable to have successful stable bulds on 7.x as to determine if the memory leak is resolved or not due to your changes the merkle caching

@tjgq
Copy link
Contributor

tjgq commented Mar 27, 2024

Understood, I will try that. From that discussion, should I potentially disable the merkle cache altogether?

In my experience, the cache is only helpful if your build is dominated by large tree artifacts (i.e., declare_directory) and runfiles trees; for everything else, it's a pessimization. So I'd definitely do some measurements to determine whether it helps or hinders.

@joeljeske
Copy link
Contributor Author

We are very mixed; some areas in code have small files & trees (golang, python, java) and others have large directories from node_modules, managed by rules_js

I'll try with and without to see if there is a benefit. How much does experimental_remote_merkle_tree_cache_size affect the heap usage, is there a static memory usage per cache size? I'm cautious to bump it up too high and result in OOMs.

@tjgq
Copy link
Contributor

tjgq commented Mar 27, 2024

Unfortunately, there's no straightforward connection between the flag value and the amount of heap consumed; each cache entry corresponds to a depset node, and has size proportional to the number of direct elements of that node (roughly num_files * (length(basename) + length(digest) + overheads)). If one of the direct elements is a tree artifact or a runfiles tree, it's promoted to its own cache entry with the corresponding number of files.

@joeljeske
Copy link
Contributor Author

I can confirm that --noexperimental_remote_merkle_tree_cache allows the build to progress normally. Did the implementation of that merkle tree caching change in 7.x? I do not see the same performance issue in 6.4.0, and wondering if some regression was introduced.

@tjgq
Copy link
Contributor

tjgq commented Mar 27, 2024

Yes, I believe an unintentional regression was introduced in the lead-up to 7.x, causing a catastrophic slowdown when the cache size is too small to fit the largest depset in the build (in 6.x, a cache too small might have been slower, but not catastrophically so). As long as the size is large enough, I haven't seen evidence that 7.x is slower than 6.x.

Unfortunately, it's difficult to revert the culprit (it's actually a combination of two different changes) because some later work has come to depend on it, and per the discussion in #21378, we'd rather spend time rearchitecting the Merkle tree cache instead of addressing performance issues with the current implementation.

@joeljeske
Copy link
Contributor Author

This makes complete sense, and now I understand why 6 was sufficient, even if less optimal than it could be. Thank you for explaining.

I will run some experiments on my end on 6.x and 7.x to evaluate the effectiveness of --experimental_remote_merkle_tree_cache if the cache size is adequate. Hopefully I won't want to object to your proposal of removing the cache altogether 😄

@oquenchil oquenchil added P3 We're not considering working on this, but happy to review a PR. (No assignee) type: support / not a bug (process) awaiting-user-response Awaiting a response from the author and removed type: bug untriaged labels Apr 2, 2024
@tjgq
Copy link
Contributor

tjgq commented Apr 2, 2024

@joeljeske I'm curious if you've had a chance to test out build performance with and without the Merkle tree cache? Knowing whether it helps in your case would be valuable input into our future plans.

@joeljeske
Copy link
Contributor Author

Hey @tjgq thanks for reaching out. I have done some spot checks with the following flow: after a bazel clean I perform a build with remote execution where all remote actions are expected to be cached.

Do you know if this flow is sufficient to evaluate the effectiveness of flipping --experimental_remote_merkle_tree_cache, or do I need to ensure actions are not cached remotely?

@tjgq
Copy link
Contributor

tjgq commented Apr 2, 2024

It's fine if the actions are cached remotely, since in order to check for a cache hit we still need to construct the Merkle trees.

@joeljeske
Copy link
Contributor Author

Excellent. So far, I have not observed much/significant performance regression when running with this flag flipped off. I will try flipping it for a portion of my fleet, and observe for any regression.

With this off, should I be setting --experimental_remote_discard_merkle_trees?

@tjgq
Copy link
Contributor

tjgq commented Apr 2, 2024

--experimental_remote_discard_merkle_trees is a separate optimization; I believe that one is generally a good tradeoff.

@joeljeske
Copy link
Contributor Author

My performance issues in 7.x are resolved with --noexperimental_remote_merkle_tree_cache.

Flipping on --experimental_remote_discard_merkle_trees resulted in excessive memory pressure, which was unexpected, so I have it off currently.

I will close this now. I eagerly await any future optimizations #21378 you may make to the merkle tree calculation process 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: support / not a bug (process)
Projects
None yet
Development

No branches or pull requests

6 participants