-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Slurm Accounting migration between ParallelCluster versions #6214
Comments
+1 |
Hi @christianversloot , thanks for reaching out and let us know about your use case. In general, you can configure the cluster to use a database with whatever arbitrary name, using the configuration Scheduling/SlurmSettings/Database/DatabaseName, introduced in ParallelCluster v3.8.0. However, as of ParallelCluster v3.9.1 your upgrade use case is not supported because:
Some follow up questions to know more about your use case:
Thank you. |
Hi @gmarciani thanks for your response! To answer your questions:
|
Additionally, even though I know that this is not the responsibility of this repository - the availability of |
Thanks for all the valuable information about your use case. Waiting for the missing info about max acceptable downtime. Regarding ParallelCluster UI, I suggest to create an issue in https://github.com/aws/aws-parallelcluster-ui/issues |
Thanks, created the request: aws/aws-parallelcluster-ui#329 |
Thank you! In the issue aws/aws-parallelcluster-ui#329 it seems that you're planning to use the DatabaseName property as soon as it will be avail in PCUI to manage the upgrade. Just to verify we are on the same page: this should not be done until we will provide the support for an external SlurmDBD in ParallelCluster, which is planned for future releases. |
Yes, understood. |
Hi @gmarciani - we had a discussion within the team and came to these downtime allowances:
Fortunately, as we've thoroughly documented upgrading a cluster between ParallelCluster (and UI) versions, I expect we should be able to stay < 1 to 1.5 hours most of the times. In other words, stopping the compute fleet in the old cluster then spinning up a new cluster is OK for us. It would be best if both clusters could be hosted within the same database cluster, either via a different way of setting things up (by separating database creation from cluster creation) or allowing two clusters with the same name to co-exist (we don't want to delete the old cluster first before setting up the new one). |
We are in the process of setting up a cluster with AWS ParallelCluster and AWS ParallelCluster UI. We are also working on writing a plan for upgrading the cluster. Given our knowledge and what we've learned online, doing so (in the case of new ParallelCluster versions) would require us to:
The cluster has Slurm Accounting setup. We use a separately deployed Aurora based RDS cluster meaning that it is not deleted in between UI upgrades. However, we've observed that when setting up a new cluster, the newly created cluster's accounting database is tightly coupled to the cluster itself by means of (1) database name and (2) table names.
The problem this gives for our cluster users is that when creating a new ParallelCluster under step 2 above, all accounting data is lost - invisible, if you will, because it's in a different database within the cluster.
We have looked into migrating with DMS, but because of the tight coupling between cluster and database (via table names), this proves to be quite difficult and potentially error prone. Unfortunately, dumping the database then inserting it into the new database instance will also not work for us, either because of the tight coupling OR because the new cluster cannot have a name equal to that of the old cluster (and we cannot have downtime while upgrading).
Looking around in both the AWS docs and on the internet, I've not found much that points me in the right direction. However, many customers must be running into this when upgrading to new ParallelCluster versions. I'd thus welcome a suggestion as to how to handle this. Is there a way to have the accounting database running loosely coupled from a ParallelCluster, allowing multiple clusters to be supported within one database (as suggested by the
cluster_table
table)? Any other approach that works for many customers? We're so far using the service quite happily, but this seems to be a bit of a roadblock.Thanks in advance!
The text was updated successfully, but these errors were encountered: