Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failures and internal errors when switching from 7.0.0rc3 to 7.0.0rc4 #20246

Closed
rsalvador opened this issue Nov 17, 2023 · 12 comments
Closed
Assignees
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@rsalvador
Copy link
Contributor

Description of the bug:

Bazel 7.0.0rc3 was building without problems our big monorepo, but 7.0.0rc4 fails with errors, e.g.:

WARNING: /Users/rsalvador/.../BUILD.bazel:3:17: output 'ui-services-connection-models-api/modelbuilder_generated' of //ui-services-connection-models-api:ui_services_connection_model is a directory; dependency checking of directories is unsound
ERROR: /Users/rsalvador/.../BUILD.bazel:3:17: modelbuilder generating srcjar modelbuilder_generated.srcjar failed: Exec failed due to IOException: /Users/rsalvador/.cache/bazel/b6de016d545fc335c6fe85486d422c54/execroot/core/bazel-out/darwin_x86_64-fastbuild/bin/ui-services-connection-models-api/modelbuilder_generated (No such file or directory)

and

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData6{actionLookupKey=ConfiguredTargetKey{label=//tools/build/bazel/rules/lwc/js:bundle, config=BuildConfigurationKey[6aa701ced8cd3692bfda0c63eeb4de6bb53c8649464bf1688db8c6f5976ed915]}, actionIndex=6}' (requested by nodes 'ArtifactNestedSetKey[6]@774500339', 'ArtifactNestedSetKey[5]@1662963755')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:550)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.UnsupportedOperationException: Cannot get node id for RegularFileArtifactValue{digest=e58e4f6d4f57ab1e4a503a523d1c4e34ed03eee292a0237a898ac158c1b7bc69, size=1080, proxy=ctime of 1700240624111 and mtime of 499162500000 and nodeId of 2582129255}
	at com.google.devtools.build.lib.remote.RemoteActionFileSystem$2.getNodeId(RemoteActionFileSystem.java:634)
	at com.google.devtools.build.lib.vfs.DigestUtils$CacheKey.<init>(DigestUtils.java:67)
	at com.google.devtools.build.lib.vfs.DigestUtils.manuallyComputeDigest(DigestUtils.java:193)
	at com.google.devtools.build.lib.vfs.DigestUtils.getDigestWithManualFallback(DigestUtils.java:160)
	at com.google.devtools.build.lib.exec.SpawnLogContext.computeDigest(SpawnLogContext.java:348)
	at com.google.devtools.build.lib.exec.SpawnLogContext.listDirectoryContents(SpawnLogContext.java:296)
	at com.google.devtools.build.lib.exec.SpawnLogContext.logSpawn(SpawnLogContext.java:130)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:192)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:261)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1148)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1065)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:859)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:333)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:171)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)

the first error we can probably fix in the generator, but the internal error may point to some regression?

The internal error goes away if we don't use the --execution_log_json_file and --noexecution_log_sort flags.

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We can't provide a minimal example, hit happens deep into the build of a very big monorepo.

Which operating system are you running Bazel on?

MacOS Version 14.1

What is the output of bazel info release?

release 7.0.0rc4

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Yes, it is a regression.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@keertk
Copy link
Member

keertk commented Nov 17, 2023

cc @Wyverald @meteorcloudy

@Wyverald
Copy link
Member

Looking at the list of commits between rc3 and rc4, it does seem there were quite a few execution-log-related commits (for example c767aa4). cc @tjgq

To pinpoint where the problem is, @rsalvador it would be nice if you ran a bazelisk bisect in your project, or alternatively make a minimal repro. Either would help us address this as quickly as possible.

@rsalvador
Copy link
Contributor Author

There may be a problem with my machine/environment, the first build problem is now also happening with rc3.
Please don't take this issue into account for the release, I'll add more info once I can reproduce reliably and find a commit with bisect. Sorry about the confusion.

@iancha1992 iancha1992 added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Nov 17, 2023
@keertk keertk added team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels Nov 18, 2023
@keertk
Copy link
Member

keertk commented Nov 18, 2023

@bazel-io fork 7.0.0

@rsalvador
Copy link
Contributor Author

The first error:

ERROR: /Users/rsalvador/.../BUILD.bazel:3:17: modelbuilder generating srcjar modelbuilder_generated.srcjar failed: Exec failed due to IOException: /Users/rsalvador/.cache/bazel/b6de016d545fc335c6fe85486d422c54/execroot/core/bazel-out/darwin_x86_64-fastbuild/bin/ui-services-connection-models-api/modelbuilder_generated (No such file or directory)

happens because we have directory dependencies between actions and with -- noincompatible_disallow_unsound_directory_outputs the project still builds in 7.0.0, but it seems that the directory dependencies between actions are no taken into account. E.g. actions that depend on a directory are sometimes executed before the action that generates the directory. We fixed this error by removing --noincompatible_disallow_unsound_directory_outputs and properly using ctx.actions.declare_directory(). This error was not specific to rc4.

Regarding the internal error:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData6{actionLookupKey=ConfiguredTargetKey{label=//tools/build/bazel/rules/lwc/js:bundle, config=BuildConfigurationKey[6aa701ced8cd3692bfda0c63eeb4de6bb53c8649464bf1688db8c6f5976ed915]}, actionIndex=6}' (requested by nodes 'ArtifactNestedSetKey[6]@774500339', 'ArtifactNestedSetKey[5]@1662963755')

bazelisk bisect found that it is due to this commit: c456082, the error occurs during builds using --execution_log_json_file

@davido
Copy link
Contributor

davido commented Nov 19, 2023

@rsalvador We have found another regression in Gerrit Code Review build machinery, that was hard to reproduce due to Bazel caching. I had to wipe out the whole cache entirely.

See this issue for more details.

@rsalvador
Copy link
Contributor Author

The initial run of bazelisk --bisect failed to identify the correct commit, as the internal error occurs only when the --disk_cache option is used and the build action is sourced from the disk cache. The correct commit is: c767aa4.

@tjgq
Copy link
Contributor

tjgq commented Nov 20, 2023

@rsalvador Thanks for the bisect! A couple of followup questions as I try to repro this:

  1. What kind of rule is //tools/build/bazel/rules/lwc/js:bundle? Is the rule implementation publicly available?
  2. If not, would you be able to provide the output of bazel aquery //tools/build/bazel/rules/lwc/js:bundle?

@rsalvador
Copy link
Contributor Author

@tjgq is this rule: https://github.com/aspect-build/rules_rollup/blob/main/rollup/defs.bzl
this is how we use it:

load("@aspect_rules_rollup//rollup:defs.bzl", "rollup")

package(default_visibility = ["//visibility:public"])

rollup(
    name = "bundle",
    srcs = glob(["src/**/*.js"]),
    config_file = "//tools/build/bazel/rules/lwc/js:rollup.config.js",
    entry_point = "index.js",
    format = "cjs",
    node_modules = "//:node_modules",
    deps = [
        "//:node_modules/@babel/core",
        "//:node_modules/@babel/preset-typescript",
        "//:node_modules/@lwc-platform/sfdc-lwc-compiler",
        "//:node_modules/@rollup/plugin-commonjs",
        "//:node_modules/@rollup/plugin-json",
        "//:node_modules/@rollup/plugin-node-resolve",
        "//:node_modules/minimist",
    ],
)

let me know if you need the aquery output

tjgq added a commit to tjgq/bazel that referenced this issue Nov 20, 2023
This is an attempt to fix bazelbuild#20246 purely from guesswork. Note the salient features of the stack trace in the bug report:

1. The crash occurs while attempting to obtain a digest for a file.
2. DigestUtil#getDigestWithManualFallback falls back to computing the digest manually, implying that RAFS#getFastDigest returned null.
3. RAFS#stat() produces a FileStatus with a missing getNodeId() implementation.

(3) implies that RAFS#statInMemory was successful, while (2) implies that it wasn't. One possibility is that the file in question is a symlink, so getFastDigest fails to retrieve the metadata for the symlink itself, while stat() follows the symlink and successfully returns the metadata for its target.

PiperOrigin-RevId: 583987445
Change-Id: I65e586ea84635a279208e24c421f54ae46ee21b8
@tjgq
Copy link
Contributor

tjgq commented Nov 20, 2023

@rsalvador I have a tentative fix, but some guesswork is involved and I'm not sure it's the right one. Would it be possible for you to build a custom Bazel from https://github.com/tjgq/bazel/tree/execlog-digest-crash-fix (clone and checkout that brach, then run bazel build //src:bazel) and check whether it works?

@rsalvador
Copy link
Contributor Author

@tjgq that fixed it, thx!

@tjgq
Copy link
Contributor

tjgq commented Nov 20, 2023

@rsalvador Thanks for confirming! I've sent the patch for internal review.

For future me: this can also be reproed with rules_rollup - USE_BAZEL_VERSION=7.0.0rc4 bazel build --disk_cache=disk --execution_log_binary_file=log --noexecution_log_sort //example:bundle but one has to (1) comment out .bazeliskrc and (2) use an --override_module to s/experimental_allow_unresolved_symlinks/allow_unresolved_symlinks/ in the aspect_rules_js module.

@iancha1992 iancha1992 changed the title [7.0.0rc4] Build failures and internal errors when switching from 7.0.0rc3 to 7.0.0rc4 Build failures and internal errors when switching from 7.0.0rc3 to 7.0.0rc4 Nov 20, 2023
@joeleba joeleba removed the untriaged label Nov 21, 2023
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Nov 21, 2023
The methods are documented as such in FileSystem. If we don't do this, there will be a discrepancy between getFastDigest and stat, as the latter can follow symlinks. This can manifest as a crash (see bazelbuild#20246) as the digest computation will take the missing fast digest for a symlink as a signal to compute the digest manually; this would fail when the symlink target is an in-memory file, which doesn't have an associated inode as required to compute the cache key (see DigestUtils#manuallyComputeDigest).

Fixes bazelbuild#20246.

PiperOrigin-RevId: 584297990
Change-Id: I65e586ea84635a279208e24c421f54ae46ee21b8
keertk pushed a commit that referenced this issue Nov 21, 2023
…c3 to 7.0.0rc4 (#20278)

The methods are documented as such in FileSystem. If we don't do this,
there will be a discrepancy between getFastDigest and stat, as the
latter can follow symlinks. This can manifest as a crash (see #20246) as
the digest computation will take the missing fast digest for a symlink
as a signal to compute the digest manually; this would fail when the
symlink target is an in-memory file, which doesn't have an associated
inode as required to compute the cache key (see
DigestUtils#manuallyComputeDigest).

Fixes #20246.

Commit
aab19f7

PiperOrigin-RevId: 584297990
Change-Id: I65e586ea84635a279208e24c421f54ae46ee21b8

Co-authored-by: Googler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants