-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an HTTP-only remote cache, similar to v1 #11149
Comments
Is this technically feasible, given the nature of v2 caching and the Remote Execution API? |
It is possible. But this cache would not be compatible with remote execution, which would make it very much a split focus. I think that if someone wanted to implement this, we would be more than happy to accept the patch. But I don't think that any of the current maintainers are likely to implement it themselves, since we're fairly committed to the REAPI. |
If this is referring to something like In particular, as soon as we're having to run a server as a critical piece of our software supply chain/build infrastructure, it's something we have to administer, monitor and keep up to date (as well as potentially needing to juggle and secure additional sets of auth credentials) and, more fundamentally, trust. Whereas an S3 bucket (or equivalent in GCS or B2 or ...) doesn't require us to be running any potentially-exploitable custom code, and typically a team will already have to be managing AWS credentials (or GCP or ...) securely for access to other services. I note that various other similar tools have 'simple' remote caching (with various subsets of particular cloud providers supported) in addition to varying degrees of remote execution:
(Nx and Turborepo seem to only support connecting to a service associated with the tool, but this has a DX/devops/trust profile closer to a generic cloud blob storage than running a REAPI server manually, although not quite the same.) |
This is an no-functionality-change refactoring of `store::remote::ByteStore::load_bytes_with` that's arguably cleaner and also step towards #11149. In particular: 1. that method doesn't need to take a closure any more, and thus is refactored to just be the "simplest": `load_bytes(...) -> Result<Option<Bytes>, String>` 2. that method previously didn't retry, and thus users had to do the retries themselves: this moves the retries to be fully within the `load_bytes` method itself, which is both easier to use, and keeps implementation details like gRPC (previously exposed as the `ByteStoreError::Grpc`/`tonic::Status` error variant) entirely contained to `store::remote::ByteStore` 3. to emphasise that last point, the `ByteStoreError` enum can thus become private, because it's an implementation detail of `store::remote::ByteStore`, no longer exposed in the public API Step 1 resolves (and removes) a TODO comment. That TODO references #17065, but this patch _doesn't_ fix that issue.
This separates the `store::remote::ByteStore` coordination code from the gRPC REAPI implementation code, by: - adding a `store::remote::ByteStoreProvider` trait - moving the REAPI implementation into `store::remote::reapi` and implementing that trait This is, in theory, just a lift-and-shift, with no functionality change, other than maybe changing some logging and when exactly a work-unit is created. This is partial preparation work for supporting more remote stores like GHA cache or S3, for #11149, and is specifically broken out of #17840. Additional work required to actually solve #11149: - generalising the action cache too - implementing other byte store providers - dynamically choosing the right provider for `store::remote::ByteStore` The commits are individually reviewable: 1. preparatory clean-up 2. define the trait 3. move the REAPI code and implement the trait, close to naively as possible: - https://gist.github.com/huonw/475e4f0c281fe39e3d0c2e7e492cf823 has a white-space-ignoring diff between `remote.rs` after 1, and the `reapi.rs` file in this commit - there are some changes: making functions non-generic, and eliminating the work-units etc. (all the code that was left in `remote.rs`) 5. minor clean-up 6. more major clean-up: inline the method calls into the trait definition, so that they're not just `fn method(&self, ...) { self.real_method(...) }`: - as with 3, this is just a move, without changing the details of the code... - ...except for some minor changes like converting `?` into `.map(|e| e.to_string())?` on the `get_capabilities` call in `store_bytes`
This separates the `remote::remote_cache` coordination code from the gRPC REAPI implementation by: - adding a `remote::remote_cache::ActionCacheProvider` trait - moving the REAPI implementation into `remote::remote_cache::reapi` and implementing that trait This is, in theory, just a lift-and-shift, with no functionality change. This is preparation work for supporting more remote stores like GHA cache or S3, for #11149, and is specifically broken out of #17840. It continues #19050. Additional work required to actually solve #11149: - implementing other byte store and action cache providers - dynamically choosing the right providers for `store::remote::ByteStore` and `remote::remote_cache` The commits are individually reviewable: 1. preparatory breaking out of gRPC code 2. defining the trait 3. move the REAPI code and implement the trait, close to naively as possible: - https://gist.github.com/huonw/a60ad807b05ecea98387294c22de67cb has a white-space-ignoring diff between `remote_cache.rs` after 1, and the `reapi.rs` file in this commit (it's less useful than #19050, since most of the code is deleted, but, buried in there are chunks of completely unchanged code) 4. minor clean-up
This does preparatory refactoring towards #11149 (and probably also #19049), by adjusting `store::remote::ByteStore` in a few ways to make it easier to plop in new 'providers': - package the various options for creating a provider into a struct, and a bunch of mechanical refactoring to use that struct - improved tests: - explicit tests for the REAPI provider - more extensive exercising of the `ByteStore` interface, including extra tests for combinations like "`load_file` when digest is missing", and (to me) more logical test names - testing the 'normal' `ByteStore` interface via fully in-memory simple `Provider` instead of needing to run the `StubCAS` - testing some more things at lower levels, like the REAPI provider doing hash verification, the `LoadDestination::reset` methods doing the right thing and `ByteStore::store_buffered` The commits are mostly individually reviewable, although I'd recommend having a sense of the big picture as described above before going through them one-by-one. After this, the next steps towards #11149 will be: 1. do something similar for the action cache 2. implement new providers, with some sort of factory function for going from `RemoteOptions` to an appropriate `Arc<dyn ByteStoreProvider + 'static>` 3. expose settings to select and configure those new providers
This does preparatory refactoring towards #11149, by adjusting `remote::remote_cache::CommandRunner` in a few ways to make it easier to plop in new 'providers': - package the various options for creating a provider/command runner into structs, and a bunch of mechanical refactoring to use those structs - explicit tests for the REAPI provider This continues #19424, but, unlike that one, doesn't refactor `remote_cache_tests.rs` to (mostly) use in-memory providers, as those tests are significantly more complicated, with many more services than just the get/update caching provider and I don't think it strictly blocks #11149. After this, the next steps towards #11149 will be: 1. implement new providers, with some sort of factory function for constructing the appropriate provider 2. expose settings in `pants.toml` to select and configure those new providers
@huonw I totally agree with your statement above and just realized, you already started on preparatory changes. Just wanted to share my appreciation for the efforts! @MichalOleszak FYI |
…9711) This (hopefully) optimises storing large blobs to a remote cache, by streaming them directly from the file stored on disk in the "FSDB". This builds on the FSDB local store work (#18153), relying on large objects being stored as an immutable file on disk, in the cache managed by Pants. This is an optimisation in several ways: - Cutting out an extra temporary file: - Previously `Store::store_large_blob_remote` would load the whole blob from the local store and then write that to a temporary file. This was appropriate with LMBD-backed blobs. - With new FSDB, there's already a file that can be used, no need for that temporary, and so the file creation and writing overhead can be eliminated . - Reducing sync IO in async tasks, due to mmap: - Previously `ByteStore::store_buffered` would take that temporary file and mmap it, to be able to slice into `Bytes` more efficiently... except this is secretly blocking/sync IO, happening within async tasks (AIUI: when accessing a mmap'd byte that's only on disk, not yet in memory, the whole OS thread is blocked/descheduled while the OS pulls the relevant part of the file into memory, i.e. `tokio` can't run another task on that thread). - This new approach uses normal `tokio` async IO mechanisms to read the file, and thus hopefully this has higher concurrency. - (This also eliminates the unmaintained `memmap` dependency.) I haven't benchmarked this though. My main motivation for this is firming up the provider API before adding new byte store providers, for #11149. This also resolves some TODOs and even eliminates some `unsafe`, yay! The commits are individually reviewable. Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the same problem but with an approach that wasn't possible at the time).
…9827) This is the final major step that unlocks #11149: - adding new byte store and action cache providers, implemented using https://opendal.apache.org - plumbing through to be able to be use them via `remote_store_address = "..."` - (and implementing an "example" one, that's backed by the local file system) This doesn't yet resolve that issue, as it doesn't implement any sort of useful "simple" truly-remote cache, but it sets up most of the infrastructure. ### User facing change This only adds support for one additional store for now: `experimental:file:///path/to/dir`, which writes the store to `/path/to/dir`. This is store is enough to validate that this functionality (mostly) works, and even seems like it would be useful for: - debugging/testing remote cache behaviour (e.g. can set up reduced reproducers for remote caching bugs using normal file manipulation commands) - potentially for real, for using an NFS mount as a remote cache, without forcing _all_ of Pants' caches to have the NFS overhead. The user facing functionality here is configured entirely through `remote_store_address`: that option now supports URIs that have schemes other than `grpc` or `grpcs`. To start with, just `file`. It also explicitly supports these schemes being experimental, communicated through the `experimental:` scheme-prefix. This is to, hopefully, allow landing the feature earlier and with less risk. For example, running `pants test ::` in the follow example writes various blobs to subdirectories of `/tmp/remote-cache-example` (e.g. `action-cache/e3/2f/e32f034f...` and `byte-store/50/ce/50ce4a68...`), and a second `pants --no-pantsd --no-local-cache test ::` command is able to service it from cache, reporting ` //:slow-test succeeded in 5.01s (cached remotely)` ```toml # pants.toml [GLOBAL] pants_version = "2.17.0" backend_packages = ["pants.backend.shell"] remote_store_address = "experimental:file:///tmp/remote-cache-example" remote_cache_read = true remote_cache_write = true ``` ```python # BUILD experimental_test_shell_command(name="slow-test", command="sleep 5", tools=["sleep"]) ``` ### Details of this change There's three main bits of work: 1. add the new generic OpenDAL byte store and action cache provider (in the two pairs of `base_opendal.rs` + tests files) 2. hook up the specific FS one (in `store/src/remote.rs` and `remote/src/remote_cache.rs`) 3. adjust the options parsing to support them, especially the `experimental:` scheme handling (in `global_options.py`) None of these are huge in isolation, but they add up to a fair chunk of code. I think each of those files can be reviewed somewhat in isolation, in that order. ### Why OpenDAL? It's used by sccache for similar remote-caching purposes (https://github.com/search?q=repo%3Amozilla%2Fsccache%20opendal&type=code), and supports a whole lot of services: - blob stores (e.g. S3, GCS) - key-value caches (e.g. Redis) - CI caches (e.g. GitHub Actions) - weirder ones (IPFS, Google Drive) Pants doesn't/shouldn't support all of them, but definitely blob stores and CI caches seem exciting! ### Next steps This only adds support for caching "remotely" to the local file system but I'll add support for GitHub Actions as follow up work (#19831), and maybe S3 too. Hopefully as part of doing that I'll work through any authentication (etc.) issues and so it becomes easier for others to add the backends they need too. I've suggested we use the URI scheme for deciding on the service, rather than an explicit `remote_store_service = "reapi"` option, or similar. In future, some services may use the same scheme, e.g. I imagine several services might conventionally use `http://` or `https://`. My current thinking is to solve this similar to pip's [VCS requirements](https://pip.pypa.io/en/stable/topics/vcs-support/) (`vcs+transport://...`) or `sqlalchemy` [database URLs](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls) (`dialect+driver://...`), e.g. `remote_store_address = "s3+http://..."` or `remote_store_address = "http+s3://..."`, and then the provider would strip off the `s3` part as appropriate. There's also a whole lot of TODOs that I don't think are critical to landing this code, and I can chip away at once it's in: by landing this sooner, I'm hoping we can start having the user-facing parts of #11149 being tested, validated and enjoyed sooner rather than later. I've already done a loooot of *pre*-factoring (see all the PRs referencing #11149), so it'd be nice to switch to having the feature in and just do some normal *re*-factoring. 😅 TODOs mostly tracked in #19902
Nice post! Here are some notes from the OpenDAL side that may be worth considering as input for you.
Apart from all those services' own native cache services, storage services like s3, azblob, gcs are all already supported by OpenDAL. Please let us know if you want more 😆 |
This adds support for remote caching to the Github Actions Cache, by setting `remote_store_address = "experimental:github-actions-cache+https://..."`. This is implemented as just another OpenDAL (#19827) provider. As written, this requires the `ACTIONS_CACHE_URL` and `ACTIONS_RUNTIME_TOKEN` values to be massaged into the appropriate pants configuration. I've included basic docs for how to do this. This set-up can/should be packaged into the `init-pants` action in future, but for now, we're in experimental-land. There's various ways in which this can be improved, but I think it's helpful to have a "real world" example of a remote cache beyond the `file://` one to be able to guide the refactoring, plus good to start testing this earlier. Thus, I'd like to land this, and do them as follow-up: - better docs/help (always!), particularly on the `using-pants-in-ci` docs page, which I haven't updated here - now that I've implemented this, I don't like my `<provider>+<scheme>://...` proposal from #19827. I think it'd be better to have a separate `remote_store_provider = "experimental-github-actions-cache"` (or `"reapi"`, etc.) option and `remote_store_address` is just the raw URL (with the provider validating it can understand the URL, of course) - it'd be good to be able to pass the token in in a simpler way, I'm thinking there could be a `remote_oauth_bearer_token` option literally, which is expected (and suggested in docs) to be passed as the `PANTS_REMOTE_OAUTH_BEARER_TOKEN` env var, in addition to `remote_oauth_bearer_token_path` These are tracked in #19902 (and all of the others in that issue would be good too, but weren't as acutely surfaced by this change, I'll start chipping away at it once this is in). This is awkward to test. My plan, not implemented here, is to somehow tee up a dedicated set of tests to run within https://github.com/nektos/act somehow. For now, we'll have to go with manual testing, e.g. this specific PR is validated by https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6298908094/job/17119702612#step:11:983 This is the first example of something useful user-facing for #11149. This doesn't close it, I'll wait until we have a blob store like S3 too. The commits are individually reviewable. (Replaces #17840.)
No description provided.
The text was updated successfully, but these errors were encountered: