Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OpenDAL-backed remote stores (file://, initially) #19827

Merged
merged 29 commits into from
Sep 21, 2023

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Sep 13, 2023

This is the final major step that unlocks #11149:

  • adding new byte store and action cache providers, implemented using https://opendal.apache.org
  • plumbing through to be able to be use them via remote_store_address = "..."
  • (and implementing an "example" one, that's backed by the local file system)

This doesn't yet resolve that issue, as it doesn't implement any sort of useful "simple" truly-remote cache, but it sets up most of the infrastructure.

User facing change

This only adds support for one additional store for now: experimental:file:///path/to/dir, which writes the store to /path/to/dir.

This is store is enough to validate that this functionality (mostly) works, and even seems like it would be useful for:

  • debugging/testing remote cache behaviour (e.g. can set up reduced reproducers for remote caching bugs using normal file manipulation commands)
  • potentially for real, for using an NFS mount as a remote cache, without forcing all of Pants' caches to have the NFS overhead.

The user facing functionality here is configured entirely through remote_store_address: that option now supports URIs that have schemes other than grpc or grpcs. To start with, just file. It also explicitly supports these schemes being experimental, communicated through the experimental: scheme-prefix. This is to, hopefully, allow landing the feature earlier and with less risk.

For example, running pants test :: in the follow example writes various blobs to subdirectories of /tmp/remote-cache-example (e.g. action-cache/e3/2f/e32f034f... and byte-store/50/ce/50ce4a68...), and a second pants --no-pantsd --no-local-cache test :: command is able to service it from cache, reporting //:slow-test succeeded in 5.01s (cached remotely)

# pants.toml
[GLOBAL]
pants_version = "2.17.0"

backend_packages = ["pants.backend.shell"]

remote_store_address = "experimental:file:///tmp/remote-cache-example"
remote_cache_read = true
remote_cache_write = true
# BUILD
experimental_test_shell_command(name="slow-test", command="sleep 5", tools=["sleep"])

Details of this change

There's three main bits of work:

  1. add the new generic OpenDAL byte store and action cache provider (in the two pairs of base_opendal.rs + tests files)
  2. hook up the specific FS one (in store/src/remote.rs and remote/src/remote_cache.rs)
  3. adjust the options parsing to support them, especially the experimental: scheme handling (in global_options.py)

None of these are huge in isolation, but they add up to a fair chunk of code. I think each of those files can be reviewed somewhat in isolation, in that order.

Why OpenDAL?

It's used by sccache for similar remote-caching purposes (https://github.com/search?q=repo%3Amozilla%2Fsccache%20opendal&type=code), and supports a whole lot of services:

  • blob stores (e.g. S3, GCS)
  • key-value caches (e.g. Redis)
  • CI caches (e.g. GitHub Actions)
  • weirder ones (IPFS, Google Drive)

Pants doesn't/shouldn't support all of them, but definitely blob stores and CI caches seem exciting!

Next steps

This only adds support for caching "remotely" to the local file system but I'll add support for GitHub Actions as follow up work (#19831), and maybe S3 too. Hopefully as part of doing that I'll work through any authentication (etc.) issues and so it becomes easier for others to add the backends they need too.

I've suggested we use the URI scheme for deciding on the service, rather than an explicit remote_store_service = "reapi" option, or similar. In future, some services may use the same scheme, e.g. I imagine several services might conventionally use http:// or https://. My current thinking is to solve this similar to pip's VCS requirements (vcs+transport://...) or sqlalchemy database URLs (dialect+driver://...), e.g. remote_store_address = "s3+http://..." or remote_store_address = "http+s3://...", and then the provider would strip off the s3 part as appropriate.

There's also a whole lot of TODOs that I don't think are critical to landing this code, and I can chip away at once it's in: by landing this sooner, I'm hoping we can start having the user-facing parts of #11149 being tested, validated and enjoyed sooner rather than later. I've already done a loooot of pre-factoring (see all the PRs referencing #11149), so it'd be nice to switch to having the feature in and just do some normal re-factoring. 😅

TODOs mostly tracked in #19902

@huonw huonw changed the title Add support for OpenDAL-backed remote stores (file:// only, initially) Add support for OpenDAL-backed remote stores (file://, initially) Sep 13, 2023
@huonw huonw marked this pull request as ready for review September 13, 2023 11:49
@huonw huonw requested review from stuhood and thejcannon September 13, 2023 11:49
@huonw
Copy link
Contributor Author

huonw commented Sep 14, 2023

I've refreshed the GHA cache backend #17840 PR to #19831, which seems to indicate that this is going in the right direction: that PR is very small!

@stuhood
Copy link
Member

stuhood commented Sep 15, 2023

Thanks @huonw ! Sorry that I didn't get to this this week: will review by EOD Monday.

@huonw
Copy link
Contributor Author

huonw commented Sep 16, 2023

All good, no rush. It's a big one!

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

src/python/pants/option/global_options.py Show resolved Hide resolved

pub struct Provider {
/// This is public for easier testing of the action cache provider
// TODO: move all the providers into a single crate so that the pub isn't necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, on the contrary: I think that it would be good to separate the grpc based provider into a separate crate from the opendal provider, so that they can be compiled independently / in-parallel.

The process_execution crate had previously gotten unwieldy due to having grpc, docker, and nailgun clients all living in it, and I think that the store crate is likely to end up in the same position with another client implementation added to it (particular one as potentially-hefty as opendal).

Copy link
Contributor Author

@huonw huonw Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, potentially a crate for with the traits and then separate providers is better.

This source comment is a bit ambiguous, sorry, but it's more referencing that it'd be nice to have the store and action cache providers for each service together in one crate, not necessarily that the gRPC/REAPI and OpenDAL ones should be grouped too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting into a bit of a distracting discussion, but separate crates for each provider might end up with 4 new crates, which is probably fine/manageable?

graph BT
   store("fs/store (existing)")
   remote("process_execution/remote (existing)")
   traits("remote_providers/traits (new)")
   grpc("remote_providers/reapi (new)")
   opendal("remote_providers/opendal (new)")
   selector("remote_providers (new)")
   grpc --> traits
   opendal --> traits
   selector --> grpc
   selector --> opendal
   store --> selector
   remote --> selector
Loading

Where the "top-level" remote_providers centralises things like the functions for choosing a provider based on options, and, potentially more of the validation code that's currently in global_options.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That layout looks good, yea.

Comment on lines +82 to +83
// TODO: we shouldn't need to gin up a whole copy of this struct; it'd be better to have the two
// set of remoting options managed together.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of potential cleanup that can happen here, yea.

Ideally we could expose a struct all the way through to Python (with a very light Python wrapper around the inner type), such that it could directly be constructed from global options. Currently that's not the case -- in particular: see ExecutionOptions, which is copied into positional args to the scheduler.

src/rust/engine/fs/store/src/remote.rs Outdated Show resolved Hide resolved
@huonw
Copy link
Contributor Author

huonw commented Sep 21, 2023

I've filed #19902 tracking the refactoring for me.

@huonw huonw merged commit de1a3d3 into main Sep 21, 2023
@huonw huonw deleted the huonw/11149-dal branch September 21, 2023 23:33
huonw added a commit that referenced this pull request Sep 27, 2023
This adds support for remote caching to the Github Actions Cache, by
setting `remote_store_address =
"experimental:github-actions-cache+https://..."`. This is implemented as
just another OpenDAL (#19827) provider.

As written, this requires the `ACTIONS_CACHE_URL` and
`ACTIONS_RUNTIME_TOKEN` values to be massaged into the appropriate pants
configuration. I've included basic docs for how to do this. This set-up
can/should be packaged into the `init-pants` action in future, but for
now, we're in experimental-land.

There's various ways in which this can be improved, but I think it's
helpful to have a "real world" example of a remote cache beyond the
`file://` one to be able to guide the refactoring, plus good to start
testing this earlier. Thus, I'd like to land this, and do them as
follow-up:

- better docs/help (always!), particularly on the `using-pants-in-ci`
docs page, which I haven't updated here
- now that I've implemented this, I don't like my
`<provider>+<scheme>://...` proposal from #19827. I think it'd be better
to have a separate `remote_store_provider =
"experimental-github-actions-cache"` (or `"reapi"`, etc.) option and
`remote_store_address` is just the raw URL (with the provider validating
it can understand the URL, of course)
- it'd be good to be able to pass the token in in a simpler way, I'm
thinking there could be a `remote_oauth_bearer_token` option literally,
which is expected (and suggested in docs) to be passed as the
`PANTS_REMOTE_OAUTH_BEARER_TOKEN` env var, in addition to
`remote_oauth_bearer_token_path`

These are tracked in #19902 (and all of the others in that issue would
be good too, but weren't as acutely surfaced by this change, I'll start
chipping away at it once this is in).

This is awkward to test. My plan, not implemented here, is to somehow
tee up a dedicated set of tests to run within
https://github.com/nektos/act somehow. For now, we'll have to go with
manual testing, e.g. this specific PR is validated by
https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6298908094/job/17119702612#step:11:983

This is the first example of something useful user-facing for #11149.
This doesn't close it, I'll wait until we have a blob store like S3 too.

The commits are individually reviewable.

(Replaces #17840.)
huonw added a commit that referenced this pull request Oct 5, 2023
This does a no-functionality-change refactoring of the remote _store_
providers (backed by REAPI and OpenDAL). This splits the concept of a
remote provider out into their own internal "surface" crate
`remote_provider`, plus three crates that are (mostly)
implementation-details of that one:

- `remote_provider/remote_provider_traits` for the traits (and structs!)
that the various providers have to implement
- `remote_provider/remote_provider_reapi` for the gRPC/REAPI-backed
provider
- `remote_provider/remote_provider_opendal` for the OpenDAL-backed
provider

They're arranged like this:

```mermaid
graph BT
   remote("process_execution/remote (existing)")
   store("fs/store (existing)")
   traits("remote_provider_traits (new)")
   grpc("remote_provider_reapi (new)")
   opendal("remote_provider_opendal (new)")
   selector("remote_provider (new)")
   grpc --> traits
   opendal --> traits
   selector --> grpc
   selector --> opendal
   remote -- for remote execution --> grpc
   remote --> selector
   store --> selector
```

Theoretically the new crates other than `remote_provider` are an
implementation detail... except there's a helper in
`remote_provider_reapi` that's used for the remote _execution_ in
`process_execution/remote`, in addition to the byte store and action
cache, hence the dependency there.


This is one point of #19902,
following up on
#19827 (comment).

The commits are individually reviewable, although the overall PR view
gives a slightly more useful view of the overall file renames for some
files.
huonw added a commit that referenced this pull request Jan 15, 2024
This replaces the over-complicated address URI scheme-based look-up from
#19827 for deciding on which provider to use with just a normal option
`[GLOBAL].remote_provider`. This defaults to `reapi`, to preserve the
current (stable) behaviour.

That is, now, one can set:

```toml
[GLOBAL]
remote_provider = "experimental-file"
remote_store_address = "file:///..."
```

Or similar.

Another item in #19694 

The commits are somewhat usefully broken up for review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants