Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the base of CSI staging_target_paths and target_paths configurable #13263

Closed
ejweber opened this issue Jun 6, 2022 · 9 comments · Fixed by #13919
Closed

Make the base of CSI staging_target_paths and target_paths configurable #13263

ejweber opened this issue Jun 6, 2022 · 9 comments · Fixed by #13919
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/enhancement
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Jun 6, 2022

In the linked PR, @tgross effectively ensured that containers running CSI plugins would receive staging_target_paths like /local/csi/staging. The work was aimed at making giving multiple CSI plugins unique locations outside their containers to stage mounts at (while inside their containers, each plugin would see an isolated /local/csi/staging directory). The assumption (I think) was that CSI plugins should not care what path a mount is staged at.

In the below comment, @tgross mentioned he would eventually make this /local/csi path configurable in the csi_plugin jobspec. Though there was some discussion on whether or not there was even a use case for this, it was implied that the work would be done.

Note to reviewers: this PR is already huge and adding to this field to the jobspec is both a minor change and a lot of code to review. I'm going to do it in a follow-on PR.

Originally posted by @tgross in #12078 (comment)

For better or worse, the BeeGFS CSI driver does need this change to work with the GA Nomad CSI implementation. (We were originally able to run in the beta implementation.) For a variety of reasons, the BeeGFS CSI driver container image does not ship with the mount utility or certain other required userspace utilities (e.g. beegfs-ctl). Instead, we use a bind mount and a chroot to execute the versions of these utilities that are already installed on a running node. This works fine in Kubernetes, where staging_target_paths are reasonably predictable and provided as absolute paths that resolve in the node's mount namespace (e.g. /var/lib/kubelet/plugins/kubernetes.io/csi/pv/). A simple bind mount like /var/lib/kubelet/plugins/kubernetes.io/csi/pv/ : /var/lib/kubelet/plugins/kubernetes.io/csi/pv/ ensures the plugin code and the chrooted utilities have the same view of the mount location.

Because Nomad provides staging_target_paths like /local/csi/staging/..., a similar solution is not possible. We would like to be able to configure Nomad as @tgross suggested. For example, if Nomad would (by default) create the bind mount /opt/nomad/opt/nomad/client/csi/monolith/beegfs-plugin : /local/csi (as it does today), we would instead like it to create the bind mount /opt/nomad/opt/nomad/client/csi/monolith/beegfs-plugin : /opt/nomad/opt/nomad/client/csi/monolith/beegfs-plugin and provide staging_target_paths like /opt/nomad/opt/nomad/client/csi/monolith/beegfs-plugin/staging/...

Is the referenced jobspec changed still planned?

@tgross
Copy link
Member

tgross commented Jun 6, 2022

Hi @ejweber! Thanks for opening this issue. I'm happy to see CSI plugin authors taking a look at Nomad. So just for my understanding of the problem, the plugin is in compliance with the spec because it accepts the staging_target_path given to it in the NodeStageVolumeRequest, but it's trying to use a predictable path for purposes of finding the chroot for the mount tools? If we were to implement this, wouldn't you need to prepare the chroot inside the Nomad client directory before the plugins are even placed? (And maybe that's acceptable to your users, but just want to make sure I understand.)

Is the referenced jobspec changed still planned?

It looks like I never opened a new issue to track that, as we just didn't think it would be something anyone cared about. Mea cupla. We can treat this issue as the feature request and try to get it roadmapped. It's not a huge fix by any means; we'll just have to juggle it in with the other priorities. We'd also be thrilled to review a patch for it if you're interested!

@ejweber
Copy link
Contributor Author

ejweber commented Jun 7, 2022

We mount the root file system read-only to the plugin container like /:/host and link a small number of named binaries (beegfs-ctl, mount, etc.) to a lightweight Go application in the plugin container's path that we have dubbed chwrap. When one of the named binaries is called, chwrap is actually executed. chwrap looks for the named binary in predictable locations in /host. If chwrap finds the binary, it executes it chrooted to /host. For many intents and purposes (and critically, for dynamic library linking), the binary executes as if it was executing directly on the host. This solution is not ideal (as it breaks some of the tenants of container isolation), but it is currently necessary for us.

The requested change comes into play because code "inside" the container and the utilities "outside" the container need to see the same staging_target_path and target_path. If we know ahead of time that a staging_target_path will be like /opt/nomad/client/csi/monolith/beegfs-plugin/... (as we do for Kubernetes), we can create a bind mount like /opt/nomad/client/csi/monolith/beegfs-plugin/ : /opt/nomad/client/csi/monolith/beegfs-plugin/. The code executing "inside" the container works on /opt/nomad/client/csi/monolith/beegfs-plugin/..., and beegfs-ctl and/or mount effectively operate on /host/opt/nomad/client/csi/monolith/beegfs-plugin/... through the chwrap functionality.

We recognize that this is a bit of a strange use case. If all userspace utilities shipped in the container image (as is generally expected), we would be able to use the Nomad CSI functionality as currently implemented. However, while trying to understand what changed between our Nomad CSI beta implementation and Nomad CSI GA, we came across the comment suggesting the very change we need to get working in the way we already do in Kubernetes. If it is implemented, we have a straightforward path to supporting Nomad "again" (we have some example manifests that no longer work).

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 17, 2022
@ejweber
Copy link
Contributor Author

ejweber commented Jul 25, 2022

We'd also be thrilled to review a patch for it if you're interested!

Apologies for the delay. I've been quite interested in contributing a patch that moves this issue along, but I needed to navigate my company's process for approving Open Source contributions. I submitted the above PR (#13919) to that end. As its my first contribution to Nomad, I suspect it will need some work.

@apollo13
Copy link
Contributor

Hi, not trying to derail this issue, but given the hoops you have to go through to make this csi driver work (chroot etc) I wonder if it might not be an option for you (and maybe even easier) to run with the raw_exec driver instead of docker? While raw_exec has the sound of being dangerous, nomad drivers can be limited to namespaces, so you can restrict them to operators only.

@ejweber
Copy link
Contributor Author

ejweber commented Jul 27, 2022

Hello @apollo13,

This is an interesting thought that we haven't considered. On the surface at least, it sounds like this approach would make it easy to run the driver binary without the added complexity associated with paths and containerization.

However, there are a couple of things that make me less inclined to take this approach.

  1. Our driver is primarily designed for and tested in Kubernetes, though we see a lot of potential benefit in supporting Nomad as well. (Who knows, maybe one day Nomad will be the primary use case.) There is no raw_exec equivalent in Kubernetes, and the standard Kubernetes CSI mechanisms make use of containerized drivers. We have already solved the problems around containerization in the Kubernetes environment, and it is pretty easy to port this solution to Nomad as well (through manifests alone) assuming the basic requirements I outlined earlier in the issue are met.
  2. We don't currently have a mechanism for downloading a driver binary itself (we host prebuilt containers on Docker Hub). We could likely host prebuilt binaries in GitHub releases, etc., but we'd prefer to maintain our single distribution mechanism unless this becomes impossible.

If this issue is ultimately closed without the changes we are hoping for, we may certainly revisit other options (and the one you are suggesting can now be added to the list). However, given that @tgross was already planning something to this effect and given that the issue is in stage/accepted, we're hopeful we've already identified the path forward.

@apollo13
Copy link
Contributor

Hi @ejweber,

There is no raw_exec equivalent in Kubernetes, and the standard Kubernetes CSI mechanisms make use of containerized drivers.

Sure, kubernetes wants everything to be containerized :D I mainly suggested raw_exec because I think it would be easier. I fully understand that k8s might have won the war and we might end up with docker containers as package format shoved down our throat anyways…

While I understand that you are hesitant to go down a second road when you already have something working I personally prefer simplicity in all things. In that sense I do not think that chroot and chwrap make things easier to debug when they fail.

Regarding your descriptions here, I am left with one question though: Even if you can control target_path inside the container you do not know it's path outside the container necessarily -- I'd consider the paths outside an implementation detail. Do you plan to rely on the fact that it is most likely not going to change often and just recommend to users to figure out where that path is and then set that? Seems a bit fragile to me but I guess it will probably work in practice.

Out of curiosity; is there a deployment example for beegfs (the fs itself, not the csi) that runs inside nomad?

@tgross tgross self-assigned this Aug 1, 2022
@tgross tgross added this to the 1.3.3 milestone Aug 2, 2022
@tgross
Copy link
Member

tgross commented Aug 2, 2022

#13919 has been merged and will ship in the next planned patch version of Nomad! Thanks @ejweber!

@ejweber
Copy link
Contributor Author

ejweber commented Aug 3, 2022

Regarding your descriptions here, I am left with one question though: Even if you can control target_path inside the container you do not know it's path outside the container necessarily -- I'd consider the paths outside an implementation detail. Do you plan to rely on the fact that it is most likely not going to change often and just recommend to users to figure out where that path is and then set that? Seems a bit fragile to me but I guess it will probably work in practice.

You're right that it's a bit brittle (as we had to deal with in this issue). We considered implementing some large(ish) changes in the driver itself with some configuration flags that would skirt the issue by doing some path transformations between internal and external paths, but this would also involve user input (if the external paths they used differed from the defaults we expected). Our experience with Kubernetes is that every distribution has used the same external paths going back many versions and we generally expect the same with Nomad now that CSI has gone GA. We don't expect most users to have to set the stage_publish_base_dir field, as the base manifests we provide will already it preconfigured for most clusters.

Hopefully we will eventually be able to release a larger container that simply has the utilities we need in the future (the decision to go distroless is a policy one), but raw_exec is certainly interesting. I plan to give it a shot when I can as an alternative approach.

I haven't ever deployed BeeGFS on Nomad (the BeeGFS CSI driver presumes an existing BeeGFS file system, and we have many such file systems set up in our test environment already), but it should be relatively straightforward to do depending on the use case. I'm not sure exactly what would be gained by doing it, as I tend to see Nomad jobs as being somewhat ephemeral and BeeGFS file systems as being largely permanent. However, if BeeGFS was installed on all Nomad nodes and a job wanted to set up an ephemeral file system using node-local storage targets (maybe the nodes have unused NVMe drives or something like that), it could be interesting. Or if an administrator was all in on Nomad and just wanted to use Nomad to manage all aspects of their infrastructure (we definitely see this with some Kubernetes distributions/uses), it might be something they'd want.

@github-actions
Copy link

github-actions bot commented Dec 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/enhancement
Projects
Development

Successfully merging a pull request may close this issue.

3 participants