Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent File Storage in k8s environments #3056

Closed
3 of 8 tasks
Tracked by #3942
knolleary opened this issue Nov 6, 2023 · 24 comments
Closed
3 of 8 tasks
Tracked by #3942

Persistent File Storage in k8s environments #3056

knolleary opened this issue Nov 6, 2023 · 24 comments
Assignees
Labels
consideration A potential feature or improvement that is under review for possible development and implementation epic A significant feature or piece of work that doesn't easily fit into a single release headline Something to highlight in the release sales request requested by a sales lead
Milestone

Comments

@knolleary
Copy link
Member

knolleary commented Nov 6, 2023

Description

We introduced the File Nodes and the File Server as a workaround to the fact our NR instances do not have a persistent file system. This allowed us to provide 'working' File Nodes that were familiar to existing NR users, however they have some significant drawbacks.

  • Only the File Read/Write nodes have been reworked to use the File Server. There are other 3rd party nodes that expect file system access that we cannot reasonably modify to work with our File Server. For example, the sqlite node provides a super convenient way to store queryable data locally - but the db file has to be on the local disk.
  • As usage grows, our File Server component becomes an increasingly critical part of the architecture which brings its own set of scalability challenges.

Following a number of discussions on this topic, we want to revisit the original decision not to attach persistent storage volumes to our cloud-hosted Node-RED instances.

This is only scoped to the k8s driver in the first instance. Docker will require a different approach and LocalFS already has local file system access.

The goal will be for each instance to have a volume attached with the appropriate space quota applied.

Open questions:

  • How to migrate the files generated by the existing file nodes to the appropriate new bit of storage
    • Or do we keep that 'legacy' mode and only new instances gain the persistent storage option
  • How does it play with HA mode; does each HA copy get the same FS mounted, or are they distinct and separate file systems. May depend on what k8s allows us to do...

User Value

Prior to this, interaction with cloud-based file storage was only possible using our own, custom, file nodes. This PR will allow any nodes (e.g. ui builder, sqllite) to have file persistence when running on FlowFuse Cloud.

Tasks

Preview Give feedback
  1. hardillb
  2. 1 of 2
    size:L - 5
    hardillb
  3. hardillb

Customer Requests

@knolleary knolleary added the epic A significant feature or piece of work that doesn't easily fit into a single release label Nov 6, 2023
@MarianRaphael
Copy link
Contributor

See also: #1779

@MarianRaphael MarianRaphael added the consideration A potential feature or improvement that is under review for possible development and implementation label Nov 27, 2023
@knolleary knolleary added the sales request requested by a sales lead label May 8, 2024
@joepavitt joepavitt moved this from Short to Next in ☁️ Product Planning May 8, 2024
@joepavitt joepavitt added this to the 2.5 milestone May 8, 2024
@hardillb
Copy link
Contributor

@joepavitt should this now be on the dev board so it can be in the design stage?

@joepavitt
Copy link
Contributor

Thanks for checking @hardillb

@hardillb
Copy link
Contributor

hardillb commented May 15, 2024

Assumptions:

  • Each Instance gets it's own space (no longer team scoped)
  • Any quota is also on a per instance basis
  • This should be K8s hosting agnostic, e.g. should work on any K8s install not just AWS EKS

Questions:

  • Where to mount the volume? (/data would be best, but has problems see comment)
  • When do we delete data? (storage would still be billed even for a suspended instance)

Research required:

  • What StorageClasses are provided by AWS/3rd party
  • A comparison of prices per gb vs limitations on type of storage (AWS pricing on storage is multidimensional)
  • Check where sqlite node stores an un qualified db file.

Mounting any volume on /data (the userDir) would mean that node_modules would persist across restarts. This would mean that installed nodes would be persistent decreasing start up time, but this would cause problems when stack changes happen, as this could change the NodeJS version and require a rebuild of any native components.

We would also want the mount point to be in the "Current working directory" of the Node-RED process, so that any files created without a fully qualified path end up in the mounted volume. The core Node-RED File nodes have an option that can be set in settings.js to control this but I don't think any 3rd party nodes honour this setting.

AWS Storge options:

EBS https://aws.amazon.com/ebs/ - block based, need File System on top (but K8s provisioner will format on creation)
EFS https://aws.amazon.com/efs/ - self scaling filesystem based
FSx https://aws.amazon.com/fsx - not sure this would work for what we want, it's more about a SAN in the cloud by the look of things
S3 https://aws.amazon.com/s3 - Object Storage, while it can look like a filesystem, don't think this is what we want

@Steve-Mcl
Copy link
Contributor

Would having a filesystem (automatically) permit virtual memory and thus improve the memory issues/crashes witnessed while installing?

Mounting any volume on /data (the userDir) would mean that node_modules would persist across restarts. This would mean that installed nodes would be persistent increasing start up time

I thought having persistent FS would decrease start up time? (typo?)

@hardillb
Copy link
Contributor

hardillb commented May 15, 2024

Would having a filesystem (automatically) permit virtual memory and thus improve the memory issues/crashes witnessed while installing?

Not possible, you can't add swap space inside a container

I thought having persistent FS would decrease start up time? (typo?)

Yes typo

@Steve-Mcl

This comment was marked as off-topic.

@hardillb

This comment was marked as off-topic.

@hardillb
Copy link
Contributor

hardillb commented May 17, 2024

Also need to decide on quota implementation, It looks like the smallest increment we can mount is 1GB on AWS.

Need to know where this will sit in the Team/Enterpise level and what happens on migrations between levels in FFC (given the work Nick has had to do for instance sizes being unavailable at higher levels)

@ppawlowski
Copy link
Contributor

ppawlowski commented May 20, 2024

We should approach this topic from two perspectives - core app in general and FFC on EKS.

For the first one - the core app (or probably the k8s driver) should use dynamic storage provisioning approach - create a Persistent Volume Claim based on the provided storage class configuration and use it in the deployment definition.
As a software provider, we cannot determine each possible storage class. As stated in the linked documentation, the cluster administrator is responsible for creating a storage class that meets the requirements. The name of the storage class should be passed to the application as a configuration parameter.

From the FFC perspective - once the above is implemented, we are limited to EBS and EFS. We should aim to use EFS than EBS, due to the following reasons:

  • unlike EFS, EBS is AWS-zone scoped (affects High Availability)
  • unlike EFS, EBS does not allow to mount it in multiple pods with the ReadWriteMany option (affects High Availability)
  • The amount of EBS volumes which can be mounted in the single EC2 instance is limited
  • in EBS we pay for the disk size, in EFS - for actual usage
  • resizing EBS volume is possible but requires manual intervention

Having all the above in mind, EFS should be our first choice. However, behind EFS there is an NFS protocol. My main concern is its performance. Before making any production-ready decisions I will suggest making strong PoC first.

Although using AWS S3 as an AWS EKS storage is possible via a dedicated CSI driver, we should avoid it since it does not support dynamic provisioning.

References:
https://zesty.co/blog/ebs-vs-efs-which-is-right/
https://www.justaftermidnight247.com/insights/ebs-efs-and-s3-when-to-use-awss-three-storage-solutions/

@knolleary
Copy link
Member Author

knolleary commented May 20, 2024

Summary of discussion between @hardillb and myself:

  1. This option will only be available to AWS hosted instances using the k8s driver - eg FFC and FFDedicated
    2. Self-hosted k8s users will require design work on what storage services can be used. Out of scope for first iteration.
  2. EFS is the first choice of backend for this.
  3. Need to clarify some of the limits EFS applies to ensure its a solution we can scale (see below)
  4. Volume will be mounted as /data/storage and we'll update nr-launcher to use that as the working directory of the NR process.

Migration from existing file store

Exact details TBD, but one option will be for nr-launcher to copy files back from the file-store prior to starting Node-RED the first time it starts up with the new storage option.

We will identify the current list of instances actively using the file store and assess the scale of migration needed. It may be we can apply something more manual at a small scale - although need to consider self-hosted customers who choose to adopt this.

Availability

We already provide a storage quota per team type - but that is limited to our File Nodes and has limited uptake (will get exact numbers to back this assertion up)

We have two options:

  • Carry over existing storage quotas per team type - all team types get something.
  • Restrict fully persistent storage to the higher tiers

Ultimately this will be a choice we can make further down the implementation as it will be a final stage configuration to apply to the platform.

Open questions

The following items need some additional research to ensure we have a scalable solution.

The EFS limits are documented as:

  • Limit of 1000 separate EFS volumes per AWS account
  • Limit of 120 ‘access points’ per volume
  • Limit of 400 mount points per network chunk (VPC) - unsure if that’s nodes or pods

We provide each instance its storage via an access point on a volume, and each EFS volume can accommodate 120 access points - thus we'll have capacity for 120k instances. The volume limit is also one that can be increased on request. We'll need a way to manage the mapping of instance to volume to ensure utilisation.

What is not currently clear is the mount points per VPC limit; does that apply to the underlying nodes or the pods (eg individual NR instances). That is an order-of magnitude difference - and if its the latter, we're already beyond that limit. @hardillb is following up on this via AWS support forums.

@knolleary
Copy link
Member Author

Clarifications on the EFS limits:

  • 1000 access points per filesystem, 1000 filesystems per AWS account - so in total 1,000,000 instances possible
    • will need to figure out how to shard instances across filesystems automatically and manage the distribution
  • 1400 mount targets per AWS account, these are how the EFS filesystem is exposed to the VPC, we should only need 3 (one for each AZ)

ref: https://repost.aws/questions/QUOS-IQj4pSa2TZ2YouHe_AA/efs-limits-in-and-eks-environment-total-number-of-volumes-access-points-mount-points

@joepavitt joepavitt moved this from Next to Started in ☁️ Product Planning May 22, 2024
@joepavitt joepavitt removed this from the 2.5 milestone May 28, 2024
@joepavitt joepavitt added the headline Something to highlight in the release label May 29, 2024
@joepavitt joepavitt added this to the 2.6 milestone May 29, 2024
@hardillb
Copy link
Contributor

Looking at what will be needed for AWS EFS with AccessPoints I think we will need 2 separate storage solutions.

  • A basic K8s version that will just generate a PVC (Persistent Volume Claim) against a Storage Class that will auto provision a volume
  • A AWS EFS version that will take a EFS filesystem as a config option and will then create a AccessPoint on that EFS and then create a PVC specifically for that AccessPoint.

@joepavitt
Copy link
Contributor

@hardillb are you able to provide a rough delivery date for this please?

@hardillb
Copy link
Contributor

Assuming testing today goes well, the technical parts are pretty much done, with the exception of how to enforce the quota requirements.

At this time I have no idea how long that will take.

@joepavitt
Copy link
Contributor

Any updates please @hardillb - release next week, and marketing asking whether this highlight will be delivered

@hardillb
Copy link
Contributor

The code changes are up for review

We need to:

  • install the EFS driver into the staging and production clusters
  • decide what we are doing with the quota (because we don't have anything yet)
  • need clear guidance on who/where this will be available to on FF Cloud

@joepavitt
Copy link
Contributor

Okay, and when will we answer those questions, who is responsible for answering/actioning?

@hardillb
Copy link
Contributor

I'll get with @ppawlowski tomorrow to install the EFS driver so it's ready

The question on access was asked higher up and then left, it's a product call on if we make this only available to higher tiers, but the old file storage is currently available to all just with different quota sizes.

The fact we don't have a quota solution for this at the moment may impact the last point.

@joepavitt
Copy link
Contributor

@hardillb status update please - ready to go for tomorrow?

@hardillb
Copy link
Contributor

hardillb commented Jul 3, 2024

@joepavitt should be, I need the following reviewing/merging:

I'm finishing off the last of the environment prep at the moment.

@joepavitt
Copy link
Contributor

@hardillb assuming we can close this out now?

@knolleary
Copy link
Member Author

The core feature has been delivered - so yes, I think we can close this off.

There are some residual tasks to complete which we should get raised separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consideration A potential feature or improvement that is under review for possible development and implementation epic A significant feature or piece of work that doesn't easily fit into a single release headline Something to highlight in the release sales request requested by a sales lead
Projects
Status: Closed / Done
Status: Done
Development

No branches or pull requests

6 participants