Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ship nfs-ganesha V6 in main and squid #2246

Open
ktdreyer opened this issue Oct 2, 2024 · 13 comments
Open

ship nfs-ganesha V6 in main and squid #2246

ktdreyer opened this issue Oct 2, 2024 · 13 comments
Assignees

Comments

@ktdreyer
Copy link
Member

ktdreyer commented Oct 2, 2024

nfs-ganesha V6.0 had issues in the squid container, referenced at #2244

Would you please ship the latest V6 version that has the fixes we need?

@ktdreyer
Copy link
Member Author

ktdreyer commented Oct 2, 2024

CC @kalebskeithley

@ffilz
Copy link

ffilz commented Oct 2, 2024

So what exactly is the failure? Log messages?

@dmick
Copy link
Member

dmick commented Oct 2, 2024

I see the log message in rook/rook#14771

@dmick
Copy link
Member

dmick commented Oct 2, 2024

The suggested fix, as I understand it, was to add a configuration setting to ignore failure of the system call that could be used by users of nfs-ganesha. Again, as I understand it, this means adding something to the nfs-ganesha deploy for ceph.

I proposed another possible fix for the issue, which was to add parameters to the invocation of the nfs-ganesha container to allow the system call to penetrate the seccomp jail.

Both of these seem to require further changes to Ceph nfs-ganesha deployment before we can integrate v6.x. This leads me to further believe that there's been no testing of v6.x on containerized Ceph (by far the majority deployment strategy for Ceph upstream and the only one for downstream).

@ffilz
Copy link

ffilz commented Oct 2, 2024

It has been years since Ganesha has got meaningful testing from various adopters prior to a release tag, so it is not at all surprising that V6.0 has problems.

This issue has been identified. One fix went into downstream that was accepted for upstream, but before it was merged upstream, another adopter also hit the problem and proposed a different fix.

What has ultimately gone upstream into V6.1 is a fix agreed on by participants in a Ganesha community call last week. This fix does require a config change when operating in a non-privileged container. That config change allows Ganesha to try the prctl, and if it fails with EPERM, to issue a warning message but proceed to start.

The prctl call was added in Ganesha V6-dev.19.

The prctl call is necessary to prevent a kernel deadlock if there is a local NFS mount using Ganesha as the server. The deadlock can occur if pages need to be flushed via NFS to allow Ganesha to allocate pages to proceed. Since Ganesha is blocked waiting for free pages, it can not serve the request, via the NFS client, to flush the pages necessary to be able to allocate pages.

One thing that isn't clear is if Ganesha running in a container while some other container on the host has an NFS mount, or even a non-containerized NFS mount could also trigger the problem. That may well be the case.

If the prctl is required for containerized Ganesha, then either prctl needs to be allowed for that container, or maybe there's some way to do the prctl for the container as a whole.

In any case, setting the parameter to true will allow Ganesha to proceed without prctl success, and it will be in the same boat as it was pre-V6-dev.19.

@dmick
Copy link
Member

dmick commented Oct 2, 2024

Yep. So the new version will require a change in ceph deployment code before it can even run, and it's untested-in-Ceph code beyond that deployment change. (To be clear I'm talking about at least cephadm/ceph orchestrator; I don't even know if we currently support deploying on non-cephadm deploys.) This PR cannot be accepted until those changes, and perhaps at least some kind of testing, are complete.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@ktdreyer
Copy link
Member Author

Hi @guits

@dmick
Copy link
Member

dmick commented Oct 25, 2024

@ktdreyer As far as I know it's still the case that we have to change something to add allow_set_io_flusher_fail=true to the nfs-ganesha config /etc/ganesha/ganesha.conf. I'm not exactly sure what agent is going to have to change to set this, probably cephadm, which already knows about ganesha.conf, but without it, the service will fail to start in a container because it can't do a prctl(PR_SET_IO_FLUSHER).

@dmick
Copy link
Member

dmick commented Oct 25, 2024

...and yes, of course, for CI builds we'll also need to change ceph.git's container/Containerfile

@ffilz
Copy link

ffilz commented Oct 26, 2024 via email

@ktdreyer
Copy link
Member Author

@ffilz , would you please open a ticket at https://tracker.ceph.com/ so that the cephadm developers understand what changes to make in Ceph's installer?

(Here's an example ticket you can use for inspiration: https://tracker.ceph.com/issues/65144 , it was under "Project: Orchestrator")

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants