Run NFS servers in-cluster #50

yuvipanda · 2020-10-08T09:47:45Z

Description

We currently run a separate, hand-rolled VM for NFS.
Instead we should run an in-cluster NFS server - one per cluster
most likely (for overprovisioning reasons).

I'm slightly concerned here, since the NFS server node going down
means all the hubs are out. But that's also true for the proxy,
nginx-ingress & other pods, so probably something we should be ok
with.

Benefit

Our current setup (separate VMs for NFS) is a single point of failure, not repeatably built, and a bit icky.
It also runs a VM fulltime, without a lot of resource utilization.

This change would make it easier to set up a cluster and go, and makes our whole set up a lot more
repeatable.

This will also let us add features we wanted for a while:

Per-user storage quotas, probably with XFS quotas
Automated snapshots, with VolumeSnapshots

Implementation details

We should watch out for accidental deletion - maybe make sure
the PV isn't deleted when the PVC is?

I'd like to use nfs-ganesha
for this, so I don't have to run a privileged container for nfs-kernel-server.
Seems to get wide enough use.

Tasks to complete

Decide which choice to follow from @yuvipanda's comment here: https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner (or check this off if it's already been figured out in Use nfs-server-provisioner subchart in support #613)
Use nfs-server-provisioner subchart in support #613

The text was updated successfully, but these errors were encountered:

ianabc · 2020-10-20T06:45:24Z

I started mucking around with this and it seems to work. I'm using nfs-ganesha-server-and-external-provisioner and the only issue I ran into was their image is wrong or misconfigured (403). nfs-ganesha-server-and-external-provisioner#6 suggests reverting to the previous image and that seems to work, it lets me make PVCs.

I have it deployed on AWS in a storageClass with fstype: xfs on a 10G PVC.

yuvipanda · 2021-05-18T10:06:20Z

For AWS, we should just use EFS. It's managed, performance is acceptable, doesn't cost if we don't use it, and is well supported. It doesn't support per-user quotas, but that's ok for now.

Google File store has a minimum disk commitment of 1TB, which is unfortunately pretty expensive.

ianabc · 2021-05-18T17:27:21Z

I have one of these running with EFS for a little while with efs-provisioner. If it is useful I have some terraform terraform to add the EFS mountpoints etc.

yuvipanda · 2021-05-26T14:09:15Z

I think on AWS, EFS is the way to go and is a solved problem for the most part.

On GCP, currently we do the following:

Make a tiny VM (Often even f1 - we aren't heavy NFS users)
Attach a disk to it, often starting at 50GB or so. Just been using standard disks - pretty low performance, but seems ok enough for us.
Mount it on the VM, and format it as XFS - with prjquota set. Theoretically this will help us do per-user quotas in the future, although we have not used that at all so far.
Make sure there's an entry in /etc/fstab so it automatically mounts in the future. We should mount by UUID but we currently do not. We mount it under /exports/home-01, and create a directory homes under it to contain the home directories.
Install appropriate apt packages - nfs-kernel-server, nfs-common and xfsprogs
Add appropriate entry to /etc/exports to expose the disk. You can find the options we use here. Most important is anonuid and anongid - so all external reads / writes to the share are counted as from uid 1000. That's the uid we use in our containers, so this simplifies our setup a lot. We can't do this on EFS tho. So perhaps we should unify and not specify this here too? idk
Run exportfs -ra to make the contents of /etc/exports take effect.

We currently have something like this running on all our GCP clusters. They're all brittle, and hand maintained, and subtly different I'm sure. On our Azure cluster, we have this ansible playbook to use - but attempts to use that with gcloud compute ssh have failed.

yuvipanda · 2021-05-26T14:32:30Z

We have a few options on how to do this.

Use this dynamic NFS Server provisioner, creating NFS shares on-demand. Right now, we create a static PVC that points to the manually setup NFS server. Instead, we'll just point to a PVC dynamically provisioned by this.
Run nfs-ganesha ourselves, as a statefulset with 1 replica. This is what (1) uses internally, but we can choose to forgoe the complexity and run this overselves.
Run nfs-kernel-server ourselves, as a statefulset with 1 replica. This is what we do right now, and can port directly. However, this requires running the pod as privileged, which is a bit of a security risk.

My current intent is to go with (1)

damianavila · 2021-05-26T20:48:16Z

My current intent is to go with (1)

After reading a little bit about the 3 options, that seems a sensible choice, IMHO.

choldgraf added the Enhancement An improvement to something or creating something new. label Oct 16, 2020

This was referenced Feb 17, 2021

Provide way to repeatably deploy new clusters #36

Closed

Minimize base cost of our clusters #235

Closed

Debrief from PaleoHackWeek hub #260

Closed

yuvipanda mentioned this issue Mar 29, 2021

Fully turnkey cloud resources setup via Terraform for GCP #332

Closed

7 tasks

damianavila mentioned this issue Apr 30, 2021

Create AWS deployment infrastructure #366

Closed

6 tasks

yuvipanda mentioned this issue Aug 23, 2021

Major development Managed Hub Service v1 #610

Closed

yuvipanda mentioned this issue Aug 23, 2021

Use nfs-server-provisioner subchart in support #613

Merged

1 task

choldgraf added the 🏷️ maintenance label Aug 23, 2021

choldgraf assigned yuvipanda Aug 24, 2021

choldgraf mentioned this issue Aug 26, 2021

Staging Hub deployment for Pangeo #599

Closed

7 tasks

choldgraf added the impact: high label Aug 26, 2021

yuvipanda closed this as completed in #613 Aug 30, 2021

choldgraf mentioned this issue Aug 31, 2021

Automate user homes setup #640

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run NFS servers in-cluster #50

Run NFS servers in-cluster #50

yuvipanda commented Oct 8, 2020 •

edited by choldgraf

Loading

ianabc commented Oct 20, 2020

yuvipanda commented May 18, 2021

ianabc commented May 18, 2021

yuvipanda commented May 26, 2021

yuvipanda commented May 26, 2021

damianavila commented May 26, 2021

Run NFS servers in-cluster #50

Run NFS servers in-cluster #50

Comments

yuvipanda commented Oct 8, 2020 • edited by choldgraf Loading

Description

Benefit

Implementation details

Tasks to complete

ianabc commented Oct 20, 2020

yuvipanda commented May 18, 2021

ianabc commented May 18, 2021

yuvipanda commented May 26, 2021

yuvipanda commented May 26, 2021

damianavila commented May 26, 2021

yuvipanda commented Oct 8, 2020 •

edited by choldgraf

Loading