Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codespell: config, workflow + fixed typos #24

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .codespellrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[codespell]
skip = .git,*.pdf,*.svg
# ignore-words-list =
19 changes: 19 additions & 0 deletions .github/workflows/codespell.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
name: Codespell

on:
push:
branches: [master]
pull_request:
branches: [master]

jobs:
codespell:
name: Check for spelling errors
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v3
- name: Codespell
uses: codespell-project/actions-codespell@v1
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ All notable changes to this project will be documented in this file.
- Fix potential race condition in loading BQ job data.
- Remove deployment manager support.
- Update Nvidia to 470.82.01 and CUDA to 11.4.4
- Reenable gcsfuse in ansible and workaround the repo gpg check problem
- Re-enable gcsfuse in ansible and workaround the repo gpg check problem

## \[4.1.5\]

Expand Down Expand Up @@ -257,7 +257,7 @@ All notable changes to this project will be documented in this file.

## \[4.0.4\]

- Configure sockets, cores, threads on compute nodes for better performace with
- Configure sockets, cores, threads on compute nodes for better performance with
`cons_tres`.

## \[4.0.3\]
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ If you make an automated change (changing a function name, fixing a pervasive
spelling mistake), please send the command/regex used to generate the changes
along with the patch, or note it in the commit message.

While not required, we encourage use of `git format-patch` to geneate the patch.
While not required, we encourage use of `git format-patch` to generate the patch.
This ensures the relevant author line and commit message stay attached. Plain
`diff`'d output is also okay. In either case, please attach them to the bug for
us to review. Spelling corrections or documentation improvements can be
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ to help you get up and running and stay running.
Issues and/or enhancement requests can be submitted to
[SchedMD's Bugzilla](https://bugs.schedmd.com).

Also, join comunity discussions on either the
Also, join community discussions on either the
[Slurm User mailing list](https://slurm.schedmd.com/mail.html) or the
[Google Cloud & Slurm Community Discussion Group](https://groups.google.com/forum/#!forum/google-cloud-slurm-discuss).

Expand Down
2 changes: 1 addition & 1 deletion ansible/playbook.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
msg: >
OS ansible_distribution version ansible_distribution_major_version is not
supported.
Please use a suported OS in list:
Please use a supported OS in list:
- RHEL 7,8
- CentOS 7,8
- Debian 10
Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/lustre/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

- name: Install Dependancies
- name: Install Dependencies
package:
name:
- wget
Expand Down
2 changes: 1 addition & 1 deletion docs/cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ There are two deployment methods for cloud cluster management:

This deployment method leverages
[GCP Marketplace](./glossary.md#gcp-marketplace) to make setting up clusters a
breeze without leaving your browser. While this method is simplier and less
breeze without leaving your browser. While this method is simpler and less
flexible, it is great for exploring what `slurm-gcp` is!

See the [Marketplace Guide](./marketplace.md) for setup instructions and more
Expand Down
8 changes: 4 additions & 4 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ extra_logging_flags:

### How do I move data for a job?

Data can be migrated to and from external sources using a worflow of dependant
Data can be migrated to and from external sources using a workflow of dependent
jobs. A [workflow submission script](../jobs/submit_workflow.py.py) and
[helper jobs](../jobs/data_migrate/) are provided. See
[README](../jobs/README.md) for more information.
Expand Down Expand Up @@ -205,8 +205,8 @@ it may be allocated jobs again.
### How do I limit user access to only using login nodes?

By default, all instances are configured with
[OS Login](./glossary.md#os-login). This keeps UID and GID of users consistant
accross all instances and allows easy user control with
[OS Login](./glossary.md#os-login). This keeps UID and GID of users consistent
across all instances and allows easy user control with
[IAM Roles](./glossary.md#iam-roles).

1. Create a group for all users in `admin.google.com`.
Expand All @@ -229,7 +229,7 @@ accross all instances and allows easy user control with
1. Select boxes for login nodes
1. Add group as a member with the **IAP-secured Tunnel User** role. Please see
[Enabling IAP for Compute Engine](https://cloud.google.com/iap/docs/enabling-compute-howto)
for mor information.
for more information.

### What Slurm image do I use for production?

Expand Down
2 changes: 1 addition & 1 deletion docs/federation.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ please refer to [multiple-slurmdbd](#multiple-slurmdbd) section.

### Additional Requirements

- User UID and GID are consistant accross all federated clusters.
- User UID and GID are consistent across all federated clusters.

## Multiple Slurmdbd

Expand Down
8 changes: 4 additions & 4 deletions docs/hybrid.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ This guide focuses on setting up a hybrid [Slurm cluster](./glossary.md#slurm).
With hybrid, there are different challenges and considerations that need to be
taken into account. This guide will cover them and their recommended solutions.

There is a clear seperation of how on-prem and cloud resources are managed
There is a clear separation of how on-prem and cloud resources are managed
within your hybrid cluster. This means that you can modify either side of the
hybrid cluster without disrupting the other side! You manage your on-prem and
our [Slurm cluster module](../terraform/slurm_cluster/README.md) will manage the
Expand Down Expand Up @@ -71,7 +71,7 @@ and terminating nodes in the cloud:
- Creates compute node resources based upon Slurm job allocation and
configured compute resources.
- `slurmsync.py`
- Synchronizes the Slurm state and the GCP state, reducing discrepencies from
- Synchronizes the Slurm state and the GCP state, reducing discrepancies from
manual admin activity or other edge cases.
- May update Slurm node states, create or destroy GCP compute resources or
other script managed GCP resources.
Expand Down Expand Up @@ -260,7 +260,7 @@ controller to be able to burst into the cloud.

### Manage Secrets

Additionally, [MUNGE](./glossary.md#munge) secrets must be consistant across the
Additionally, [MUNGE](./glossary.md#munge) secrets must be consistent across the
cluster. There are a few safe ways to deal with munge.key distribution:

- Use NFS to mount `/etc/munge` from the controller (default behavior).
Expand All @@ -277,7 +277,7 @@ connections to the munge NFS is critical.

- Isolate the cloud compute nodes of the cluster into their own project, VPC,
and subnetworks. Use project or network peering to enable access to other
cloud infrastructure in a controlled mannor.
cloud infrastructure in a controlled manner.
- Setup firewall rules to control ingress and egress to the controller such that
only trusted machines or networks use its NFS.
- Only allow trusted private address (ranges) for communication to the
Expand Down
2 changes: 1 addition & 1 deletion jobs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ $ sbatch --export=MIGRATE_INPUT=/tmp/seq.txt,MIGRATE_OUTPUT=/tmp/shuffle.txt \
## submit_workflow.py

This script is a runner that submits a sequence of 3 jobs as defined in the
input structured yaml file. The three jobs submitted can be refered to as:
input structured yaml file. The three jobs submitted can be referred to as:
`stage_in`; `main`; and `stage_out`. `stage_in` should move data for `main` to
consume. `main` is the main script that may consume and generate data.
`stage_out` should move data generated from `main` to an external location.
Expand Down
2 changes: 1 addition & 1 deletion scripts/resume.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ def create_instances_request(nodes, placement_group, exclusive_job=None):
body.sourceInstanceTemplate = template

labels = dict(slurm_job_id=exclusive_job) if exclusive_job is not None else None
# overwrites properties accross all instances
# overwrites properties across all instances
body.instanceProperties = instance_properties(
partition, model, placement_group, labels
)
Expand Down
2 changes: 1 addition & 1 deletion terraform/slurm_cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ use.
Partitions define what compute resources are available to the controller so it
may allocate jobs. Slurm will resume/create compute instances as needed to run
allocated jobs and will suspend/terminate the instances after they are no longer
needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistant;
needed (e.g. IDLE for SuspendTimeout duration). Static nodes are persistent;
they are exempt from being suspended/terminated under normal conditions. Dynamic
nodes are burstable; they will scale up and down with workload.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md).
It is compatible with:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md)
intended to be used by the
[slurm_partition](../../../modules/slurm_partition/README.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md)
intended to be used by the
[slurm_controller_instance](../../../modules/slurm_controller_instance/README.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[slurm_instance_template](../../../modules/slurm_instance_template/README.md)
intended to be used by the
[slurm_login_instance](../../../modules/slurm_login_instance/README.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

## Overview

This exmaple creates a
This example creates a
[Slurm partition](../../../modules/slurm_partition/README.md).

## Usage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ limitations under the License.
| <a name="input_enable_devel"></a> [enable\_devel](#input\_enable\_devel) | Enables development mode. Not for production use. | `bool` | `false` | no |
| <a name="input_enable_reconfigure"></a> [enable\_reconfigure](#input\_enable\_reconfigure) | Enables automatic Slurm reconfigure on when Slurm configuration changes (e.g.<br>slurm.conf.tpl, partition details). Compute instances and resource policies<br>(e.g. placement groups) will be destroyed to align with new configuration.<br><br>NOTE: Requires Python and Google Pub/Sub API.<br><br>*WARNING*: Toggling this will impact the running workload. Deployed compute nodes<br>will be destroyed and their jobs will be requeued. | `bool` | `false` | no |
| <a name="input_epilog_scripts"></a> [epilog\_scripts](#input\_epilog\_scripts) | List of scripts to be used for Epilog. Programs for the slurmd to execute<br>on every node when a user's job completes.<br>See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog. | <pre>list(object({<br> filename = string<br> content = string<br> }))</pre> | `[]` | no |
| <a name="input_google_app_cred_path"></a> [google\_app\_cred\_path](#input\_google\_app\_cred\_path) | Path to Google Applicaiton Credentials. | `string` | `null` | no |
| <a name="input_google_app_cred_path"></a> [google\_app\_cred\_path](#input\_google\_app\_cred\_path) | Path to Google Application Credentials. | `string` | `null` | no |
| <a name="input_install_dir"></a> [install\_dir](#input\_install\_dir) | Directory where the hybrid configuration directory will be installed on the<br>on-premise controller (e.g. /etc/slurm/hybrid). This updates the prefix path<br>for the resume and suspend scripts in the generated `cloud.conf` file.<br><br>This variable should be used when the TerraformHost and the SlurmctldHost<br>are different.<br><br>This will default to var.output\_dir if null. | `string` | `null` | no |
| <a name="input_login_network_storage"></a> [login\_network\_storage](#input\_login\_network\_storage) | Storage to mounted on login and controller instances<br>* server\_ip : Address of the storage server.<br>* remote\_mount : The location in the remote instance filesystem to mount from.<br>* local\_mount : The location on the instance filesystem to mount to.<br>* fs\_type : Filesystem type (e.g. "nfs").<br>* mount\_options : Options to mount with. | <pre>list(object({<br> server_ip = string<br> remote_mount = string<br> local_mount = string<br> fs_type = string<br> mount_options = string<br> }))</pre> | `[]` | no |
| <a name="input_munge_mount"></a> [munge\_mount](#input\_munge\_mount) | Remote munge mount for compute and login nodes to acquire the munge.key.<br><br>By default, the munge mount server will be assumed to be the<br>`var.slurm_control_host` (or `var.slurm_control_addr` if non-null) when<br>`server_ip=null`. | <pre>object({<br> server_ip = string<br> remote_mount = string<br> fs_type = string<br> mount_options = string<br> })</pre> | <pre>{<br> "fs_type": "nfs",<br> "mount_options": "",<br> "remote_mount": "/etc/munge/",<br> "server_ip": null<br>}</pre> | no |
Expand All @@ -95,7 +95,7 @@ limitations under the License.
| <a name="input_partitions"></a> [partitions](#input\_partitions) | Cluster partitions as a list. | <pre>list(object({<br> compute_list = list(string)<br> partition = object({<br> enable_job_exclusive = bool<br> enable_placement_groups = bool<br> network_storage = list(object({<br> server_ip = string<br> remote_mount = string<br> local_mount = string<br> fs_type = string<br> mount_options = string<br> }))<br> partition_conf = map(string)<br> partition_name = string<br> partition_nodes = map(object({<br> node_count_dynamic_max = number<br> node_count_static = number<br> access_config = list(object({<br> network_tier = string<br> }))<br> bandwidth_tier = string<br> enable_spot_vm = bool<br> group_name = string<br> instance_template = string<br> node_conf = map(string)<br> spot_instance_config = object({<br> termination_action = string<br> })<br> }))<br> partition_startup_scripts_timeout = number<br> subnetwork = string<br> zone_target_shape = string<br> zone_policy_allow = list(string)<br> zone_policy_deny = list(string)<br> })<br> }))</pre> | `[]` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | Project ID to create resources in. | `string` | n/a | yes |
| <a name="input_prolog_scripts"></a> [prolog\_scripts](#input\_prolog\_scripts) | List of scripts to be used for Prolog. Programs for the slurmd to execute<br>whenever it is asked to run a job step from a new job allocation.<br>See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog. | <pre>list(object({<br> filename = string<br> content = string<br> }))</pre> | `[]` | no |
| <a name="input_slurm_bin_dir"></a> [slurm\_bin\_dir](#input\_slurm\_bin\_dir) | Path to directroy of Slurm binary commands (e.g. scontrol, sinfo). If 'null',<br>then it will be assumed that binaries are in $PATH. | `string` | `null` | no |
| <a name="input_slurm_bin_dir"></a> [slurm\_bin\_dir](#input\_slurm\_bin\_dir) | Path to directory of Slurm binary commands (e.g. scontrol, sinfo). If 'null',<br>then it will be assumed that binaries are in $PATH. | `string` | `null` | no |
| <a name="input_slurm_cluster_name"></a> [slurm\_cluster\_name](#input\_slurm\_cluster\_name) | Cluster name, used for resource naming and slurm accounting. | `string` | n/a | yes |
| <a name="input_slurm_control_addr"></a> [slurm\_control\_addr](#input\_slurm\_control\_addr) | The IP address or a name by which the address can be identified.<br><br>This value is passed to slurm.conf such that:<br>SlurmctldHost={var.slurm\_control\_host}\({var.slurm\_control\_addr}\)<br><br>See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost | `string` | `null` | no |
| <a name="input_slurm_control_host"></a> [slurm\_control\_host](#input\_slurm\_control\_host) | The short, or long, hostname of the machine where Slurm control daemon is<br>executed (i.e. the name returned by the command "hostname -s").<br><br>This value is passed to slurm.conf such that:<br>SlurmctldHost={var.slurm\_control\_host}\({var.slurm\_control\_addr}\)<br><br>See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost | `string` | n/a | yes |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -245,14 +245,14 @@ variable "partitions" {

variable "google_app_cred_path" {
type = string
description = "Path to Google Applicaiton Credentials."
description = "Path to Google Application Credentials."
default = null
}

variable "slurm_bin_dir" {
type = string
description = <<EOD
Path to directroy of Slurm binary commands (e.g. scontrol, sinfo). If 'null',
Path to directory of Slurm binary commands (e.g. scontrol, sinfo). If 'null',
then it will be assumed that binaries are in $PATH.
EOD
default = null
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ It is recommended to pass in an
[instance template](../../../../docs/glossary.md#instance-template) generated by
the [slurm_instance_template](../slurm_instance_template/README.md) module.

The controller is responisble for managing compute instances defined by multiple
The controller is responsible for managing compute instances defined by multiple
[slurm_partition](../slurm_partition/README.md).

The controller instance run [slurmctld](../../../../docs/glossary.md#slurmctld),
Expand Down
4 changes: 2 additions & 2 deletions terraform/slurm_cluster/modules/slurm_partition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ creates a Slurm partition for
Conceptutally, a Slurm partition is a queue that is associated with compute
resources, limits, and access controls. Users submit jobs to one or more
partitions to have their jobs be completed against requested resources within
their alloted limits and access.
their allotted limits and access.

This module defines a partition and its resources -- most notably, compute
nodes. Sets of compute nodes reside within a partition. Each set of compute
Expand All @@ -34,7 +34,7 @@ nodes must resolve to an
[instance template](../../../../docs/glossary.md#instance-template) is: created
by definition -- module creates an
[instance template](../../../../docs/glossary.md#instance-template) using subset
of input paramters; or by the
of input parameters; or by the
[self link](../../../../docs/glossary.md#self-link) of an
[instance template](../../../../docs/glossary.md#instance-template) that is
managed outside of this module. Additionally, there are compute node parameters
Expand Down
Loading