Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: make plugin health_timeout configurable in csi_plugin stanza #13340

Merged
merged 3 commits into from
Jun 14, 2022

Conversation

ggriffiths
Copy link
Contributor

Fixes #13179

Test results: Configured health_timeout as 5m and introduced an artificial error to make sure Portworx was killed after 5m:

Recent Events:
Time                  Type                     Description
2022-06-10T20:05:40Z  Killed                   Task successfully killed
2022-06-10T20:05:40Z  Terminated               Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2022-06-10T20:05:40Z  Killing                  CSI plugin did not become healthy before configured 5m0s health timeout
2022-06-10T20:05:40Z  Plugin became unhealthy  Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /etc/nomad.d/client/csi/plugins/62bdddd8-3955-92d2-cc23-02b6f6204e62/csi.sock: no such file or directory
2022-06-10T19:58:15Z  Started                  Task started by client
2022-06-10T19:58:13Z  Task Setup               Building Task Directory
2022-06-10T19:58:13Z  Received                 Task received by client

Signed-off-by: Grant Griffiths [email protected]

nomad/structs/structs.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @ggriffiths!

There's a jobspec checklist which I probably should have pointed out in the original issue, but it looks like you've already hit almost everything needed there except for the implementation for structs/diff.go.

We'll also want this new field added to the CSI plugin parameters documentation.

Thanks for the PR!

nomad/structs/structs.go Outdated Show resolved Hide resolved
@ggriffiths
Copy link
Contributor Author

There's a jobspec checklist which I probably should have pointed out in the original issue, but it looks like you've already hit almost everything needed there except for the implementation for structs/diff.go.

@tgross, thanks for the pointer! For structs/diff.go, should I add a new function for the CSIPluginConfig? I'm not seeing any existing logic in diff.go for this struct.

@tgross
Copy link
Member

tgross commented Jun 14, 2022

@tgross, thanks for the pointer! For structs/diff.go, should I add a new function for the CSIPluginConfig? I'm not seeing any existing logic in diff.go for this struct.

Oh. You've just discovered a bug in job plan output. 😊 We detect that the csi_plugin block has changed but aren't producing a diff of what fields have changed. This failing test shows the problem:

failing test
diff --git a/nomad/structs/diff_test.go b/nomad/structs/diff_test.go
index 1a5751a8c..153c5bb94 100644
--- a/nomad/structs/diff_test.go
+++ b/nomad/structs/diff_test.go
@@ -7377,6 +7377,40 @@ func TestTaskDiff(t *testing.T) {
                                },
                        },
                },
+               {
+                       Name: "CSIPlugin edited",
+                       Old: &Task{
+                               CSIPluginConfig: &TaskCSIPluginConfig{
+                                       ID:       "id0",
+                                       Type:     "monolith",
+                                       MountDir: "/csi",
+                               },
+                       },
+                       New: &Task{
+                               CSIPluginConfig: &TaskCSIPluginConfig{
+                                       ID:       "id0",
+                                       Type:     "monolith",
+                                       MountDir: "/whatever",
+                               },
+                       },
+                       Expected: &TaskDiff{
+                               Type: DiffTypeEdited,
+                               Objects: []*ObjectDiff{
+                                       {
+                                               Type: DiffTypeEdited,
+                                               Name: "CSIPluginConfig",
+                                               Fields: []*FieldDiff{
+                                                       {
+                                                               Type: DiffTypeEdited,
+                                                               Name: "MountDir",
+                                                               Old:  "/csi",
+                                                               New:  "/whatever",
+                                                       },
+                                               },
+                                       },
+                               },
+                       },
+               },
        }

        for _, c := range cases {

Because that's a bug and this PR is for a feature improvement, we'll need to handle those as separate PRs so we can backport the plan fix. So don't worry about diff for now and I'll come back for that later once this has been merged.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This will ship in the upcoming Nomad 1.3.2

tgross added a commit that referenced this pull request Jun 14, 2022
The changelog entry for #13340 indicated it was an improvement. But on
discussion, it was determined that this was a workaround for a
regression. Update the changelog to make this clear.
tgross pushed a commit that referenced this pull request Jun 14, 2022
tgross pushed a commit that referenced this pull request Jun 14, 2022
tgross added a commit that referenced this pull request Jun 14, 2022
The changelog entry for #13340 indicated it was an improvement. But on
discussion, it was determined that this was a workaround for a
regression. Update the changelog to make this clear.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI: make readiness probe timeout configurable to support faulty plugins
2 participants