Make entrypoint script fail if any step fails, remove unnecessary command substitution #839

drakedevel · 2020-02-12T02:37:30Z

Issue #, if available: None

Description of changes:

The entrypoint script added in #735 is not run with set -e, so if any of the commands in it fails (such as installing the CNI), the script will silently succeed. This is a regression, since the previous code would catch this and error out.

I ran into this when I noticed some nodes on our cluster running 1.6rc6 were failing to start pods with CNI errors, despite having an apparently-healthy CNI pod running. Upon looking at the pod's logs, I saw the following:

2020-02-11T17:51:20.908196958+00:00 stdout P starting IPAM daemon in background ... 
2020-02-11T17:51:20.908344789+00:00 stdout F ok.
2020-02-11T17:51:20.908357137+00:00 stdout P checking for IPAM connectivity ... 
2020-02-11T17:51:40.933888193+00:00 stdout F ok.
2020-02-11T17:51:40.933888193+00:00 stdout P copying CNI plugin binaries and config files ... 
2020-02-11T17:51:41.056950001+00:00 stderr F cp: error writing '/host/opt/cni/bin/aws-cni': Cannot allocate memory
2020-02-11T17:51:41.056950001+00:00 stderr F cp: failed to extend '/host/opt/cni/bin/aws-cni': Cannot allocate memory
2020-02-11T17:51:41.064360344+00:00 stdout F  ok.
2020-02-11T17:51:41.064424913+00:00 stdout F foregrounding IPAM daemon ...

The issue turned out to be on our end: an overly-aggressive memory limit (which worked fine on rc4, but the shell script approach increases the peak memory required by quite a bit). But nevertheless, cp failed, and the script blindly charged on, when it should probably have crashed (which would have drawn my attention to it much more quickly).

The main change I made to the script was to add set -e, which will make the script fail if any command it invokes fails (if not in the context of an if or || or similar). While I was at it, I made two additional changes that aren't causing issues but are probably good ideas:

set -u, which will make the script abort on undefined variable access (helps with typos)
Two places in the code used $(...) (command substitution) when they don't intend to execute the output of the command inside. One instance was completely unnecessary, the other I replaced with { } to group the commands.

I'm happy to revert either or both of these extra changes, but they jumped out at me while I was making the first one so I figured I'd submit them.

I've tested this change on our development cluster and it doesn't seem to have any issues. Additionally, I simulated an issue where it couldn't write to the aws-cni binary (by creating a bogus symlink), and confirmed that the pod failed to start with the following log:

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ... ok.
copying CNI plugin binaries and config files ... cp: not writing through dangling symlink '/host/opt/cni/bin/aws-cni'

which is the expected outcome with this change.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mogren · 2020-02-12T06:29:16Z

@drakedevel Thanks for testing this! I've compared rc4 and rc6, and the main difference is that the aws-cni is 11 MB in rc4, but has ballooned to 26 MB in rc6. I'll take a look at what has caused this increase. Since rc5 is even bigger at 27 MB, I wonder if it was the adding the k8s.io/cri-api that increased that size that much.

jaypipes

All good changes, thanks @drakedevel and thanks for the excellent commit message! :)

drakedevel · 2020-02-12T21:17:17Z

@jaypipes thanks for the quick review!

Would it be helpful for me to submit a PR for a cherry-pick to master, or is that handled another way?

jaypipes · 2020-02-12T21:55:21Z

Gah! I didn't realize this was proposed for merging directly into a release branch... Generally we try to only accept PRs into the master branch and then cherry-pick into the release branch.

No worries, though. Yes, please propose the same PR to master

drakedevel added 2 commits February 11, 2020 15:02

Make entrypoint shell script fail if any step fails.

02ad882

Remove unnecessary command substitutions from entrypoint script.

078ec9e

mogren requested a review from jaypipes February 12, 2020 04:16

jaypipes approved these changes Feb 12, 2020

View reviewed changes

jaypipes merged commit 44ee4d6 into aws:release-1.6 Feb 12, 2020

drakedevel deleted the entrypoint-improvements branch February 12, 2020 21:17

drakedevel mentioned this pull request Feb 12, 2020

Cherry-pick entrypoint script improvements from PR #839 #843

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make entrypoint script fail if any step fails, remove unnecessary command substitution #839

Make entrypoint script fail if any step fails, remove unnecessary command substitution #839

drakedevel commented Feb 12, 2020

mogren commented Feb 12, 2020 •

edited

Loading

jaypipes left a comment

drakedevel commented Feb 12, 2020

jaypipes commented Feb 12, 2020

Make entrypoint script fail if any step fails, remove unnecessary command substitution #839

Make entrypoint script fail if any step fails, remove unnecessary command substitution #839

Conversation

drakedevel commented Feb 12, 2020

mogren commented Feb 12, 2020 • edited Loading

jaypipes left a comment

Choose a reason for hiding this comment

drakedevel commented Feb 12, 2020

jaypipes commented Feb 12, 2020

mogren commented Feb 12, 2020 •

edited

Loading