Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init >= 20.3 fails on missing /etc/secret-userdata.txt #2757

Closed
dilyevsky opened this issue Sep 10, 2021 · 4 comments
Closed

cloud-init >= 20.3 fails on missing /etc/secret-userdata.txt #2757

dilyevsky opened this issue Sep 10, 2021 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority

Comments

@dilyevsky
Copy link
Contributor

/kind bug

What steps did you take and what happened:

  1. Upgrade cloud-init to 20.3+
  2. Attempt to provision node using CAPA control plane
  3. Node becomes stuck in Provisioning state, cloud-init init phase fails with the following error while executing user-data:
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 377, in main_init
    init.update()
  File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 363, in update
    self._store_userdata()
  File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 390, in _store_userdata
    processed_ud = self.datasource.get_userdata()
  File "/usr/lib/python3.9/site-packages/cloudinit/sources/__init__.py", line 385, in get_userdata
    self.userdata = self.ud_proc.process(self.get_userdata_raw())
  File "/usr/lib/python3.9/site-packages/cloudinit/user_data.py", line 90, in process
    self._process_msg(convert_string(blob), accumulating_msg)
  File "/usr/lib/python3.9/site-packages/cloudinit/user_data.py", line 160, in _process_msg
    self._do_include(payload, append_msg)
  File "/usr/lib/python3.9/site-packages/cloudinit/user_data.py", line 258, in _do_include
    _handle_error(message, urle)
  File "/usr/lib/python3.9/site-packages/cloudinit/user_data.py", line 74, in _handle_error
    raise Exception(error_message) from source_exception
Exception: [Errno 2] No such file or directory: '/etc/secret-userdata.txt' for url: file:///etc/secret-userdata.txt

What did you expect to happen:

User-data execution stage should not fail.

Anything else you would like to add:

So this appears to happen because cloud-init made failures during user-data execution hard failures (https://cloudinit.readthedocs.io/en/latest/topics/hacking.html):

cloudinit.features.ERROR_ON_USER_DATA_FAILURE = True
If there is a failure in obtaining user data (i.e., #include or decompress fails) and ERROR_ON_USER_DATA_FAILURE is False, cloud-init will log a warning and proceed. If it is True, cloud-init will instead raise an exception.

As of 20.3, ERROR_ON_USER_DATA_FAILURE is True.

(This flag can be removed after Focal is no longer supported.)

Previously this would be just a warning as is indicated by CAPA documentation.

I am not entirely sure why CAPA user-data script is shipping /etc/secret-userdata.txt as x-include-url in addition to extracting it from SecretManager but one workaround that comes to mind is to ship images with an empty /etc/secret-userdata.txt and test for empty file in addition to missing file in

.

Another option would be to tweak cloud-init in the node image to disable hard fail but that appears more difficult as it can't be done at runtime based on my reading of cloud-init docs.

Happy to send patch if you agree on the approach.

Environment:

Any image with cloud-init version 20.3+ on latest CAPA control plane

  • Cluster-api-provider-aws version: 0.6.8, 0.7.0
  • Kubernetes version: (use kubectl version): n/a
  • OS (e.g. from /etc/os-release):
NAME=Fedora
VERSION="34.20210821.3.0 (CoreOS)"
ID=fedora
VERSION_ID=34
VERSION_CODENAME=""
PLATFORM_ID="platform:f34"
PRETTY_NAME="Fedora CoreOS 34.20210821.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:34"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=34
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=34
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='34.20210821.3.0'
DEFAULT_HOSTNAME=localhost

Bonus - cloud-init version:

$ cloud-init --version
/usr/bin/cloud-init 20.4
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority labels Sep 10, 2021
@randomvariable
Copy link
Member

Are you not using image builder?

The images were fixed in

kubernetes-sigs/image-builder#406

@dilyevsky
Copy link
Contributor Author

Nope, custom image. Hm thanks for pointer - let me try that approach.

@dilyevsky
Copy link
Contributor Author

ugh this is actually hard to do on an image with a read-only /usr partition (CoreOS flavor) - need to make your own cloud-init package 🤦 ...

@randomvariable
Copy link
Member

You can turn off this behaviour if you're unable to get a cloud-init to work appropriately. See https://cluster-api-aws.sigs.k8s.io/topics/userdata-privacy.html#how-cluster-api-secures-tls-secrets with the caveats.

I don't think there's anything more we can do here though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority
Projects
None yet
Development

No branches or pull requests

3 participants