Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Bazel cache not being used #20844

Open
jwnrt opened this issue Jan 16, 2024 · 10 comments
Open

[ci] Bazel cache not being used #20844

jwnrt opened this issue Jan 16, 2024 · 10 comments
Labels
Component:CI Continuous Integration (Azure Pipelines & Co.) SW:Build System Type:Enhancement Feature requests, enhancements

Comments

@jwnrt
Copy link
Contributor

jwnrt commented Jan 16, 2024

Description

We believe the Bazel cache that we store in a GCP bucket and use for CI is not working correctly.

The sw_build job is set up to use the cache by running ci/bazelisk.sh for its build (which connects Bazel to the GCP bucket as its cache) and loading the GCP write credentials when run on the master branch so that it can write to the cache. As far as we can tell, the sw_build job never loads from the cache and always rebuilds from scratch.

Locally reproducing

I tried to set up a local environment that replicates the sw_build job to read from the cache too, but no success.

These are the (convoluted) steps I used (thanks to @nbdd0121 for helping with this):

  1. Added the following to ~/.bazelrc (and ~root/.bazelrc for good measure) to replicate what ci/bazelisk.sh builds:
build --remote_cache=https://storage.googleapis.com/opentitan-bazel-cache
build --remote_upload_local_results=false
build --remote_default_exec_properties=OSVersion="Ubuntu 20.04.6 LTS"
  1. Ensured my local environment variables matched CI for those set in rules/nonhermetic.bzl:
    • Set HOME="/root" - this is what the CI uses and the HOME env var is non-hermetic, so may have influenced the cache?
    • Set PATH=/tools/verilator/v4.210/bin:/tools/verible/bin:/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin which is what sw_build uses - we extracted this by running env in this PR's CI run.
    • $XILINXD_LICENSE_FILE already matched, and the other vars were unset.
  2. Execute the same build command that sw_build does but using ./bazelisk.sh instead of ci/bazelisk.sh to ensure our own .bazelrc is used.

We did not observe the cache being used at all. There must be something else contributing to the cache that we didn't account for.

Related

Not every job in the CI workflow uses the cache: some aren't using ci/bazelisk.sh so won't read from it, and some aren't loading the GCP write credentials and won't write to it.

I've opened #20836 to address these issues, but they won't make much difference if the cache isn't working.

@jwnrt jwnrt added Type:Enhancement Feature requests, enhancements Component:CI Continuous Integration (Azure Pipelines & Co.) SW:Build System labels Jan 16, 2024
@jwnrt
Copy link
Contributor Author

jwnrt commented Jan 16, 2024

I also think we need to add test --remote_cache=... lines to the bazelrc to get caching when running tests. I don't think build ... lines are used even when building tests.

@pamaury
Copy link
Contributor

pamaury commented Jan 16, 2024

I did a small experiment: I created this PR that only builds a few SW artifacts (and none of them rely on the bitstreams). I modified the pipeline to save and upload the bazel execution log. I then ran the pipeline twice using Azure to make sure that the two runs used the same code and merge commit:

The first difference appears here:

inputs {
  path: "external/python3_x86_64-unknown-linux-gnu/lib/python3.9/__pycache__/_markupbase.cpython-39.pyc"
  digest {
    hash: "6b7e9c33342668210679e6a15e4c4cd3e0b39c524bf21d006c1087baec24e84f"
    size_bytes: 7870
    hash_function_name: "SHA-256"
  }
}

In fact, many pyc objects have completely different hashes. I did not yet check how this affects the cache but I find it quite surprising that those files differ even though we supposedly manage python using bazel?

@pamaury
Copy link
Contributor

pamaury commented Jan 16, 2024

I think this is the underlying issue: bazelbuild/rules_python#1761
By default, python uses stores the compilation timestamp into the pyc...

@pamaury
Copy link
Contributor

pamaury commented Jan 16, 2024

ping @a-will @cfrantz @timothytrippel

@jwnrt
Copy link
Contributor Author

jwnrt commented Jan 16, 2024

According to https://peps.python.org/pep-0552/ .pyc files can be reproducible without timestamps, it seems to be a case of enabling this with the tooling somehow. I'm not sure how / where this would happen in Bazel's python toolchains.

@nbdd0121
Copy link
Contributor

The rules_python does contain provisions:

https://github.com/lowRISC/rules_python/blob/07c3f8547abbd5b97839a48af226a0fbcfaa5e7c/python/pip_install/extract_wheels/__init__.py#L39-L42

However we are building python wheels ourselves.

@pamaury
Copy link
Contributor

pamaury commented Jan 16, 2024

I have run another test after adding the reproducibility fixes that @nbdd0121 identified for Python. The runs are here and there. The python issue is gone but there are more:

  • some diff that I do not understand about the execution but that do not seem to be on hashes, e.g.
runner: "processwrapper-sandbox"
remote_cacheable: true
target_label: "//sw/host/opentitantool:opentitantool"
9: "Compiling Rust bin opentitantool (25 files)"
17: {
  1: 3
  2: 318463000
}

What do those numbers stand for?

  • More worrying: the volatile-status.txt and stable-status.txt files seems to make their way into a number of actions, maybe the one involving genrule though I am not certain. Since the runners may have a different username, the stable status will be different. Similarly, the volatile status will contain a timestamp which is is bad. Surprising, the remote cache does not ignore those files whereas the local cache seems to. Maybe we should look into avoiding stamping? There is also an experimental flag in bazel 7.

@pamaury
Copy link
Contributor

pamaury commented Jan 17, 2024

I have identified that the problem with stamping is that we use the workspace command in .bazelrc to include the timestamp and git revision. This is then used by //util:full_version_file which is then used by header generation and so on.

As an experiment, in bazelbuild/bazel#20318, I have tried to override this workspace command in CI to use constant values:

There are now no more differences between the stamping files, however we still have a few offenders to track:

  • bazel-out/k8-fastbuild/bin/sw/device/lib/ujson/example_roundtrip.mod-id: apparently the mod ID check does not output deterministically, I need to investigate what is happening there. DIT: this is because opentitantool output an error message with a timestamp. Since this output log is more like a log and is never used anyway, this is probably not too critical.
  • bazel-out/k8-fastbuild-ST-2cc462681f62/bin/hw/ip/otp_ctrl/data/img_rma.24.vmem: the output changes between run. This one is problematic as the result is used in a number of tests. It is generated by util/design/gen-otp-img.py so at least we should be able to debug that. EDIT: the generation script adds a comment containing the time: // Generated on ... in the vmem.
  • bazel-out/k8-fastbuild-ST-2cc462681f62/bin/sw/otbn/code-snippets/randomness.rv32embed.a: this one is weird. It is outputted together with bazel-out/k8-fastbuild-ST-2cc462681f62/bin/sw/otbn/code-snippets/randomness.elf but while the latter is reproducible, this rv32embed.a is not. This file is used in some entropy test.

@pamaury
Copy link
Contributor

pamaury commented Jan 17, 2024

I think we need to discuss and agree on how to handle bazel stamping since at the moment we are not using it consistenly:

  • Some tools that currently stamp (like gen-otp-img.py) should probably not stamp: the generated vmem files are build artifacts for which stamping making little sense
  • Some tools that currently stamp (like reggen) have a dual usage: generate files for builds but also generate files for export. When exporting, we definitely want to stamp but in CI and for builds, we probably do not want to stamp.
  • We have some dependencies (like //util:full_version_file others in //util) that specify stamp = 1 which means that any rule that depends on those file will end up indirectly require stamping.
  • Opentitantool requires stamping but is again a dual usage tool, maybe it should only stamp if required.

Bazel supports stamp = -1 (aka only stamp if necessary). We should look into whether supporting this configuration and specifying --nostamp in CI and --stamp in release is a viable option. This will require some changes in several tools but also clarify the situation.

Current users of stamping:

@jwnrt
Copy link
Contributor Author

jwnrt commented Jan 18, 2024

I also think we need to add test --remote_cache=... lines to the bazelrc to get caching when running tests. I don't think build ... lines are used even when building tests.

Correction: this isn't needed as test commands will inherit from test:

command: Bazel command, such as build or query to which the options apply. These options also apply to all commands that inherit from the specified command. (For example, test inherits from build.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component:CI Continuous Integration (Azure Pipelines & Co.) SW:Build System Type:Enhancement Feature requests, enhancements
Projects
None yet
Development

No branches or pull requests

3 participants