Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an HTTP-only remote cache, similar to v1 #11149

Open
stuhood opened this issue Nov 12, 2020 · 5 comments
Open

Add an HTTP-only remote cache, similar to v1 #11149

stuhood opened this issue Nov 12, 2020 · 5 comments
Assignees

Comments

@stuhood
Copy link
Member

stuhood commented Nov 12, 2020

No description provided.

@Eric-Arellano
Copy link
Contributor

Is this technically feasible, given the nature of v2 caching and the Remote Execution API?

@stuhood
Copy link
Member Author

stuhood commented Jul 26, 2021

It is possible. But this cache would not be compatible with remote execution, which would make it very much a split focus.

I think that if someone wanted to implement this, we would be more than happy to accept the patch. But I don't think that any of the current maintainers are likely to implement it themselves, since we're fairly committed to the REAPI.

@huonw
Copy link
Contributor

huonw commented May 12, 2022

If this is referring to something like GET/PUT requests to a cloud blob store, we'd appreciate it, as 'running' an S3 bucket is easy and limited blast radius, while a custom server brings extra risk and likely has an ongoing tail of devops work. I understand that this isn't a priority, but I thought it still might be useful to provide an additional perspective, and if I had pointers about where to start, I'd potentially consider attempting an implementation.

In particular, as soon as we're having to run a server as a critical piece of our software supply chain/build infrastructure, it's something we have to administer, monitor and keep up to date (as well as potentially needing to juggle and secure additional sets of auth credentials) and, more fundamentally, trust. Whereas an S3 bucket (or equivalent in GCS or B2 or ...) doesn't require us to be running any potentially-exploitable custom code, and typically a team will already have to be managing AWS credentials (or GCP or ...) securely for access to other services.

I note that various other similar tools have 'simple' remote caching (with various subsets of particular cloud providers supported) in addition to varying degrees of remote execution:

tool docs blob stores supported remote execution?
Bazel https://bazel.build/docs/remote-caching#cloud-storage GCS yes
Rush https://rushjs.io/pages/maintainer/build_cache/#enabling-cloud-storage Azure, S3 via BuildXL
Lage https://microsoft.github.io/lage/docs/Guide/remote-cache/ Azure via BuildXL
Gradle https://docs.gradle.org/current/userguide/build_cache.html plugins: S3, GCS, ... partially?
Nx* https://nx.app/docs/distributed-caching Nx Cloud yes
Turborepo* https://turborepo.org/docs/core-concepts/remote-caching Vercel no?

(Nx and Turborepo seem to only support connecting to a service associated with the tool, but this has a DX/devops/trust profile closer to a generic cloud blob storage than running a REAPI server manually, although not quite the same.)

stuhood pushed a commit that referenced this issue Jan 18, 2023
This is an no-functionality-change refactoring of `store::remote::ByteStore::load_bytes_with` that's arguably cleaner and also step towards #11149. In particular:

1. that method doesn't need to take a closure any more, and thus is refactored to just be the "simplest": `load_bytes(...) -> Result<Option<Bytes>, String>`
2. that method previously didn't retry, and thus users had to do the retries themselves: this moves the retries to be fully within the `load_bytes` method itself, which is both easier to use, and keeps implementation details like gRPC (previously exposed as the `ByteStoreError::Grpc`/`tonic::Status` error variant) entirely contained to `store::remote::ByteStore`
3. to emphasise that last point, the `ByteStoreError` enum can thus become private, because it's an implementation detail of `store::remote::ByteStore`, no longer exposed in the public API

Step 1 resolves (and removes) a TODO comment. That TODO references #17065, but this patch _doesn't_ fix that issue.
@huonw huonw self-assigned this Feb 13, 2023
huonw added a commit that referenced this issue May 23, 2023
This separates the `store::remote::ByteStore` coordination code from the
gRPC REAPI implementation code, by:

- adding a `store::remote::ByteStoreProvider` trait
- moving the REAPI implementation into `store::remote::reapi` and
implementing that trait

This is, in theory, just a lift-and-shift, with no functionality change,
other than maybe changing some logging and when exactly a work-unit is
created.

This is partial preparation work for supporting more remote stores like
GHA cache or S3, for #11149, and is specifically broken out of #17840.
Additional work required to actually solve #11149:

- generalising the action cache too
- implementing other byte store providers
- dynamically choosing the right provider for `store::remote::ByteStore`

The commits are individually reviewable:

1. preparatory clean-up
2. define the trait
3. move the REAPI code and implement the trait, close to naively as
possible:
  - https://gist.github.com/huonw/475e4f0c281fe39e3d0c2e7e492cf823 has a
white-space-ignoring diff between `remote.rs` after 1, and the
`reapi.rs` file in this commit
  - there are some changes: making functions non-generic, and eliminating
the work-units etc. (all the code that was left in `remote.rs`)
5. minor clean-up
6. more major clean-up: inline the method calls into the trait
definition, so that they're not just `fn method(&self, ...) {
self.real_method(...) }`:
  - as with 3, this is just a move, without changing the details of the
code...
  - ...except for some minor changes like converting `?` into `.map(|e|
e.to_string())?` on the `get_capabilities` call in `store_bytes`
huonw added a commit that referenced this issue Jun 1, 2023
This separates the `remote::remote_cache` coordination code from the
gRPC REAPI implementation by:

- adding a `remote::remote_cache::ActionCacheProvider` trait
- moving the REAPI implementation into `remote::remote_cache::reapi` and
implementing that trait

This is, in theory, just a lift-and-shift, with no functionality change.

This is preparation work for supporting more remote stores like GHA
cache or S3, for #11149, and
is specifically broken out of
#17840. It continues #19050.
Additional work required to actually solve
#11149:

- implementing other byte store and action cache providers
- dynamically choosing the right providers for
`store::remote::ByteStore` and `remote::remote_cache`

The commits are individually reviewable:
1. preparatory breaking out of gRPC code
2. defining the trait
3. move the REAPI code and implement the trait, close to naively as
possible:
  - https://gist.github.com/huonw/a60ad807b05ecea98387294c22de67cb has a
    white-space-ignoring diff between `remote_cache.rs` after 1, and the
    `reapi.rs` file in this commit (it's less useful than #19050, since most
    of the code is deleted, but, buried in there are chunks of completely
    unchanged code)
4. minor clean-up
huonw added a commit that referenced this issue Jul 12, 2023
This does preparatory refactoring towards #11149 (and probably also
#19049), by adjusting `store::remote::ByteStore` in a few ways to make
it easier to plop in new 'providers':

- package the various options for creating a provider into a struct, and
a bunch of mechanical refactoring to use that struct
- improved tests:
  - explicit tests for the REAPI provider
  - more extensive exercising of the `ByteStore` interface, including
extra tests for combinations like "`load_file` when digest is missing",
and (to me) more logical test names
  - testing the 'normal' `ByteStore` interface via fully in-memory simple
`Provider` instead of needing to run the `StubCAS`
  - testing some more things at lower levels, like the REAPI provider
doing hash verification, the `LoadDestination::reset` methods doing the
right thing and `ByteStore::store_buffered`

The commits are mostly individually reviewable, although I'd recommend
having a sense of the big picture as described above before going
through them one-by-one.

After this, the next steps towards #11149 will be:

1. do something similar for the action cache
2. implement new providers, with some sort of factory function for going
from `RemoteOptions` to an appropriate `Arc<dyn ByteStoreProvider +
'static>`
3. expose settings to select and configure those new providers
huonw added a commit that referenced this issue Jul 19, 2023
This does preparatory refactoring towards
#11149, by adjusting
`remote::remote_cache::CommandRunner` in a few ways to make it easier to
plop in new 'providers':

- package the various options for creating a provider/command runner
into structs, and a bunch of mechanical refactoring to use those structs
- explicit tests for the REAPI provider

This continues #19424, but, unlike that one, doesn't refactor
`remote_cache_tests.rs` to (mostly) use in-memory providers, as those
tests are significantly more complicated, with many more services than
just the get/update caching provider and I don't think it strictly
blocks #11149.

After this, the next steps towards #11149 will be:

1. implement new providers, with some sort of factory function for
constructing the appropriate provider
2. expose settings in `pants.toml` to select and configure those new
providers
@ericlacher
Copy link

@huonw I totally agree with your statement above and just realized, you already started on preparatory changes. Just wanted to share my appreciation for the efforts!

@MichalOleszak FYI

huonw added a commit that referenced this issue Sep 12, 2023
…9711)

This (hopefully) optimises storing large blobs to a remote cache, by
streaming them directly from the file stored on disk in the "FSDB".

This builds on the FSDB local store work (#18153), relying on large
objects being stored as an immutable file on disk, in the cache managed
by Pants.

This is an optimisation in several ways:

- Cutting out an extra temporary file:
- Previously `Store::store_large_blob_remote` would load the whole blob
from the local store and then write that to a temporary file. This was
appropriate with LMBD-backed blobs.
- With new FSDB, there's already a file that can be used, no need for
that temporary, and so the file creation and writing overhead can be
eliminated .
- Reducing sync IO in async tasks, due to mmap:
- Previously `ByteStore::store_buffered` would take that temporary file
and mmap it, to be able to slice into `Bytes` more efficiently... except
this is secretly blocking/sync IO, happening within async tasks (AIUI:
when accessing a mmap'd byte that's only on disk, not yet in memory, the
whole OS thread is blocked/descheduled while the OS pulls the relevant
part of the file into memory, i.e. `tokio` can't run another task on
that thread).
- This new approach uses normal `tokio` async IO mechanisms to read the
file, and thus hopefully this has higher concurrency.
  - (This also eliminates the unmaintained `memmap` dependency.)

I haven't benchmarked this though.

My main motivation for this is firming up the provider API before adding
new byte store providers, for #11149. This also resolves some TODOs and
even eliminates some `unsafe`, yay!

The commits are individually reviewable.

Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the
same problem but with an approach that wasn't possible at the time).
huonw added a commit that referenced this issue Sep 21, 2023
…9827)

This is the final major step that unlocks #11149: 

- adding new byte store and action cache providers, implemented using
https://opendal.apache.org
- plumbing through to be able to be use them via `remote_store_address =
"..."`
- (and implementing an "example" one, that's backed by the local file
system)

This doesn't yet resolve that issue, as it doesn't implement any sort of
useful "simple" truly-remote cache, but it sets up most of the
infrastructure.

### User facing change

This only adds support for one additional store for now:
`experimental:file:///path/to/dir`, which writes the store to
`/path/to/dir`.

This is store is enough to validate that this functionality (mostly)
works, and even seems like it would be useful for:
- debugging/testing remote cache behaviour (e.g. can set up reduced
reproducers for remote caching bugs using normal file manipulation
commands)
- potentially for real, for using an NFS mount as a remote cache,
without forcing _all_ of Pants' caches to have the NFS overhead.

The user facing functionality here is configured entirely through
`remote_store_address`: that option now supports URIs that have schemes
other than `grpc` or `grpcs`. To start with, just `file`. It also
explicitly supports these schemes being experimental, communicated
through the `experimental:` scheme-prefix. This is to, hopefully, allow
landing the feature earlier and with less risk.

For example, running `pants test ::` in the follow example writes
various blobs to subdirectories of `/tmp/remote-cache-example` (e.g.
`action-cache/e3/2f/e32f034f...` and `byte-store/50/ce/50ce4a68...`),
and a second `pants --no-pantsd --no-local-cache test ::` command is
able to service it from cache, reporting ` //:slow-test succeeded in
5.01s (cached remotely)`

```toml
# pants.toml
[GLOBAL]
pants_version = "2.17.0"

backend_packages = ["pants.backend.shell"]

remote_store_address = "experimental:file:///tmp/remote-cache-example"
remote_cache_read = true
remote_cache_write = true
```

```python
# BUILD
experimental_test_shell_command(name="slow-test", command="sleep 5", tools=["sleep"])
```

### Details of this change

There's three main bits of work:

1. add the new generic OpenDAL byte store and action cache provider (in
the two pairs of `base_opendal.rs` + tests files)
2. hook up the specific FS one (in `store/src/remote.rs` and
`remote/src/remote_cache.rs`)
3. adjust the options parsing to support them, especially the
`experimental:` scheme handling (in `global_options.py`)

None of these are huge in isolation, but they add up to a fair chunk of
code. I think each of those files can be reviewed somewhat in isolation,
in that order.

### Why OpenDAL?

It's used by sccache for similar remote-caching purposes
(https://github.com/search?q=repo%3Amozilla%2Fsccache%20opendal&type=code),
and supports a whole lot of services:

- blob stores (e.g. S3, GCS)
- key-value caches (e.g. Redis)
- CI caches (e.g. GitHub Actions)
- weirder ones (IPFS, Google Drive)

Pants doesn't/shouldn't support all of them, but definitely blob stores
and CI caches seem exciting!

### Next steps

This only adds support for caching "remotely" to the local file system
but I'll add support for GitHub Actions as follow up work (#19831), and
maybe S3 too. Hopefully as part of doing that I'll work through any
authentication (etc.) issues and so it becomes easier for others to add
the backends they need too.

I've suggested we use the URI scheme for deciding on the service, rather
than an explicit `remote_store_service = "reapi"` option, or similar. In
future, some services may use the same scheme, e.g. I imagine several
services might conventionally use `http://` or `https://`. My current
thinking is to solve this similar to pip's [VCS
requirements](https://pip.pypa.io/en/stable/topics/vcs-support/)
(`vcs+transport://...`) or `sqlalchemy` [database
URLs](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls)
(`dialect+driver://...`), e.g. `remote_store_address = "s3+http://..."`
or `remote_store_address = "http+s3://..."`, and then the provider would
strip off the `s3` part as appropriate.

There's also a whole lot of TODOs that I don't think are critical to
landing this code, and I can chip away at once it's in: by landing this
sooner, I'm hoping we can start having the user-facing parts of #11149
being tested, validated and enjoyed sooner rather than later. I've
already done a loooot of *pre*-factoring (see all the PRs referencing
#11149), so it'd be nice to switch to having the feature in and just do
some normal *re*-factoring. 😅

TODOs mostly tracked in #19902
@Xuanwo
Copy link
Contributor

Xuanwo commented Sep 25, 2023

I note that various other similar tools have 'simple' remote caching (with various subsets of particular cloud providers supported) in addition to varying degrees of remote execution:

Nice post! Here are some notes from the OpenDAL side that may be worth considering as input for you.

  • Bazel: bazel's native remote caching is similiar to a subset of WebDAV (PUT/GET/DELETE), which could be accessed via OpenDAL WebDAV Service
  • Lage: the link is broken, should be this, and it's azblob only.
  • Gradle: Gradle's native remote caching is also webdav compatible. (As described in sccache docs)
  • Nx: I didn't find the HTTP API for it's cache services, but we will be willing to integrate with opendal if possible
  • Turborepo: Turborepo's remote cache support is powered by vercel artifacts which has been supported by OpenDAL VercelArtifacts Service.

Apart from all those services' own native cache services, storage services like s3, azblob, gcs are all already supported by OpenDAL. Please let us know if you want more 😆

huonw added a commit that referenced this issue Sep 27, 2023
This adds support for remote caching to the Github Actions Cache, by
setting `remote_store_address =
"experimental:github-actions-cache+https://..."`. This is implemented as
just another OpenDAL (#19827) provider.

As written, this requires the `ACTIONS_CACHE_URL` and
`ACTIONS_RUNTIME_TOKEN` values to be massaged into the appropriate pants
configuration. I've included basic docs for how to do this. This set-up
can/should be packaged into the `init-pants` action in future, but for
now, we're in experimental-land.

There's various ways in which this can be improved, but I think it's
helpful to have a "real world" example of a remote cache beyond the
`file://` one to be able to guide the refactoring, plus good to start
testing this earlier. Thus, I'd like to land this, and do them as
follow-up:

- better docs/help (always!), particularly on the `using-pants-in-ci`
docs page, which I haven't updated here
- now that I've implemented this, I don't like my
`<provider>+<scheme>://...` proposal from #19827. I think it'd be better
to have a separate `remote_store_provider =
"experimental-github-actions-cache"` (or `"reapi"`, etc.) option and
`remote_store_address` is just the raw URL (with the provider validating
it can understand the URL, of course)
- it'd be good to be able to pass the token in in a simpler way, I'm
thinking there could be a `remote_oauth_bearer_token` option literally,
which is expected (and suggested in docs) to be passed as the
`PANTS_REMOTE_OAUTH_BEARER_TOKEN` env var, in addition to
`remote_oauth_bearer_token_path`

These are tracked in #19902 (and all of the others in that issue would
be good too, but weren't as acutely surfaced by this change, I'll start
chipping away at it once this is in).

This is awkward to test. My plan, not implemented here, is to somehow
tee up a dedicated set of tests to run within
https://github.com/nektos/act somehow. For now, we'll have to go with
manual testing, e.g. this specific PR is validated by
https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6298908094/job/17119702612#step:11:983

This is the first example of something useful user-facing for #11149.
This doesn't close it, I'll wait until we have a blob store like S3 too.

The commits are individually reviewable.

(Replaces #17840.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants