Performance regression from 6.4.0 -> 7 #21828

joeljeske · 2024-03-27T14:53:44Z

Description of the bug:

In bumping bazel to 7.x from 6.4.0, I've encountered significant performance regressions during my build. At the start of the build, Bazel appears to be progressing quickly, and consumes all available --jobs with remote actions. Within ~45min, Bazel is only able to execute 0 or 1 action remotely, and the rest are stuck in [Prepa] for a long time. I do not observe this behavior on 6.4.0, but I can consistently reproduce in 7.0.1 and 7.1.1.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Ubuntu 20.04

What is the output of `bazel info release`?

release 7.1.1

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

Potentially related to #21626

Any other information, logs, or outputs that you want to share?

2024-03-26 21:28:30 CDT	(02:28:30) [755,647 / 761,751] 16720 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2857s ... (200 actions, 0 running)
2024-03-26 21:29:58 CDT	(02:29:58) [756,389 / 762,222] 16768 / 24275 tests, 1 failed; [Prepa] <redacted target>; 2945s ... (200 actions, 2 running)
2024-03-26 21:31:04 CDT	(02:31:04) [764,176 / 769,524] 17127 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3011s ... (200 actions, 1 running)
2024-03-26 21:32:04 CDT	(02:32:04) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3071s ... (200 actions, 0 running)
2024-03-26 21:33:35 CDT	(02:33:35) [766,330 / 771,249] 17205 / 24275 tests, 1 failed; [Prepa] <redacted target>; 3161s ... (200 actions, 0 running)

Some relevant flags

--jobs 200
--remote_download_minimal
--experimental_throttle_remote_action_building  # retained default value
--bes_backend=grpc://127.0.0.1:5555
--bes_upload_mode=FULLY_ASYNC
--remote_timeout=3600
--experimental_remote_merkle_tree_cache
--experimental_remote_merkle_tree_cache_size=1000
--show_progress_rate_limit=60

Thread state: threads.txt

The text was updated successfully, but these errors were encountered:

tjgq · 2024-03-27T15:01:06Z

--experimental_remote_merkle_tree_cache_size=1000 is potentially too small (see #18686). Can you bump that up to something like 10000 and see if the issue goes away? (Unfortunately it's difficult to determine a good value; that's one of several problems with the Merkle tree cache as explained at length in #21378.)

joeljeske · 2024-03-27T15:26:41Z

Understood, I will try that. From that discussion, should I potentially disable the merkle cache altogether?

Also related, I am still observing memory leaks and OOMs as tracked in #16913 on 6.4.0. I have been unable to have successful stable bulds on 7.x as to determine if the memory leak is resolved or not due to your changes the merkle caching

tjgq · 2024-03-27T15:34:26Z

Understood, I will try that. From that discussion, should I potentially disable the merkle cache altogether?

In my experience, the cache is only helpful if your build is dominated by large tree artifacts (i.e., declare_directory) and runfiles trees; for everything else, it's a pessimization. So I'd definitely do some measurements to determine whether it helps or hinders.

joeljeske · 2024-03-27T15:43:58Z

We are very mixed; some areas in code have small files & trees (golang, python, java) and others have large directories from node_modules, managed by rules_js

I'll try with and without to see if there is a benefit. How much does experimental_remote_merkle_tree_cache_size affect the heap usage, is there a static memory usage per cache size? I'm cautious to bump it up too high and result in OOMs.

tjgq · 2024-03-27T15:56:35Z

Unfortunately, there's no straightforward connection between the flag value and the amount of heap consumed; each cache entry corresponds to a depset node, and has size proportional to the number of direct elements of that node (roughly num_files * (length(basename) + length(digest) + overheads)). If one of the direct elements is a tree artifact or a runfiles tree, it's promoted to its own cache entry with the corresponding number of files.

joeljeske · 2024-03-27T16:23:03Z

I can confirm that --noexperimental_remote_merkle_tree_cache allows the build to progress normally. Did the implementation of that merkle tree caching change in 7.x? I do not see the same performance issue in 6.4.0, and wondering if some regression was introduced.

tjgq · 2024-03-27T16:34:29Z

Yes, I believe an unintentional regression was introduced in the lead-up to 7.x, causing a catastrophic slowdown when the cache size is too small to fit the largest depset in the build (in 6.x, a cache too small might have been slower, but not catastrophically so). As long as the size is large enough, I haven't seen evidence that 7.x is slower than 6.x.

Unfortunately, it's difficult to revert the culprit (it's actually a combination of two different changes) because some later work has come to depend on it, and per the discussion in #21378, we'd rather spend time rearchitecting the Merkle tree cache instead of addressing performance issues with the current implementation.

joeljeske · 2024-03-27T16:45:02Z

This makes complete sense, and now I understand why 6 was sufficient, even if less optimal than it could be. Thank you for explaining.

I will run some experiments on my end on 6.x and 7.x to evaluate the effectiveness of --experimental_remote_merkle_tree_cache if the cache size is adequate. Hopefully I won't want to object to your proposal of removing the cache altogether 😄

tjgq · 2024-04-02T10:15:12Z

@joeljeske I'm curious if you've had a chance to test out build performance with and without the Merkle tree cache? Knowing whether it helps in your case would be valuable input into our future plans.

joeljeske · 2024-04-02T16:11:18Z

Hey @tjgq thanks for reaching out. I have done some spot checks with the following flow: after a bazel clean I perform a build with remote execution where all remote actions are expected to be cached.

Do you know if this flow is sufficient to evaluate the effectiveness of flipping --experimental_remote_merkle_tree_cache, or do I need to ensure actions are not cached remotely?

tjgq · 2024-04-02T16:13:04Z

It's fine if the actions are cached remotely, since in order to check for a cache hit we still need to construct the Merkle trees.

joeljeske · 2024-04-02T20:59:40Z

Excellent. So far, I have not observed much/significant performance regression when running with this flag flipped off. I will try flipping it for a portion of my fleet, and observe for any regression.

With this off, should I be setting --experimental_remote_discard_merkle_trees?

tjgq · 2024-04-02T21:14:41Z

--experimental_remote_discard_merkle_trees is a separate optimization; I believe that one is generally a good tradeoff.

joeljeske · 2024-04-26T15:17:24Z

My performance issues in 7.x are resolved with --noexperimental_remote_merkle_tree_cache.

Flipping on --experimental_remote_discard_merkle_trees resulted in excessive memory pressure, which was unexpected, so I have it off currently.

I will close this now. I eagerly await any future optimizations #21378 you may make to the merkle tree calculation process 😄

joeljeske added type: bug untriaged labels Mar 27, 2024

joeljeske assigned iancha1992, satyanandak and sgowroji Mar 27, 2024

github-actions bot added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 27, 2024

iancha1992 unassigned sgowroji, iancha1992 and satyanandak Mar 27, 2024

oquenchil added P3 We're not considering working on this, but happy to review a PR. (No assignee) type: support / not a bug (process) awaiting-user-response Awaiting a response from the author and removed type: bug untriaged labels Apr 2, 2024

joeljeske closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression from 6.4.0 -> 7 #21828

Performance regression from 6.4.0 -> 7 #21828

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024 •

edited

Loading

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024

joeljeske commented Mar 27, 2024

tjgq commented Apr 2, 2024

joeljeske commented Apr 2, 2024

tjgq commented Apr 2, 2024

joeljeske commented Apr 2, 2024

tjgq commented Apr 2, 2024

joeljeske commented Apr 26, 2024

Performance regression from 6.4.0 -> 7 #21828

Performance regression from 6.4.0 -> 7 #21828

Comments

joeljeske commented Mar 27, 2024

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

tjgq commented Mar 27, 2024

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024 • edited Loading

joeljeske commented Mar 27, 2024

tjgq commented Mar 27, 2024

joeljeske commented Mar 27, 2024

tjgq commented Apr 2, 2024

joeljeske commented Apr 2, 2024

tjgq commented Apr 2, 2024

joeljeske commented Apr 2, 2024

tjgq commented Apr 2, 2024

joeljeske commented Apr 26, 2024

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

tjgq commented Mar 27, 2024 •

edited

Loading