Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud-init fails for ubuntu 20.04 base AMI and Cloud-init version '23.3.1-0ubuntu1~20.04.1' #1333

Closed
supershal opened this issue Oct 25, 2023 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@supershal
Copy link

What steps did you take and what happened:

The latest cloud-init version 23.3.1-0ubuntu1~20.04.1 that is shipped with base AMI for Ubuntu 20.04 is unable to run boothook https://cloudinit.readthedocs.io/en/latest/explanation/format.html#cloud-boothook provided by CAPA, https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/0bf78b04b305a77aec37a68c107102231faa7a16/pkg/cloud/services/secretsmanager/secret_fetch_script.go#L20
As a result the CAPA VMs are not initializing as expected.

Steps to reproduce:

  1. create an AMI using image-builder
make build-ami-ubuntu-2004
  1. Create CAPA cluster using the AMI created in step 1 using instructions at: https://cluster-api-aws.sigs.k8s.io/getting-started.html

  2. Check logs at /var/log/cloud-init-output.log

What did you expect to happen:
Cloud-init run successfully on the VM

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Log from cloud-init.

2023-10-24 18:53:21] 2023-10-24 18:53:21,892 - util.py[WARNING]: failed stage init
[2023-10-24 18:53:21] failed run of stage init
[2023-10-24 18:53:21] ------------------------------------------------------------
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 78, in read_file_or_url
[2023-10-24 18:53:21]     with open(file_path, "rb") as fp:
[2023-10-24 18:53:21] FileNotFoundError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] The above exception was the direct cause of the following exception:
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 238, in _do_include
[2023-10-24 18:53:21]     resp = read_file_or_url(
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 84, in read_file_or_url
[2023-10-24 18:53:21]     raise UrlError(cause=e, code=code, headers=None, url=url) from e
[2023-10-24 18:53:21] cloudinit.url_helper.UrlError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] The above exception was the direct cause of the following exception:
[2023-10-24 18:53:21]
[2023-10-24 18:53:21] Traceback (most recent call last):
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 766, in status_wrapper
[2023-10-24 18:53:21]     ret = functor(name, args)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 453, in main_init
[2023-10-24 18:53:21]     init.update()
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 484, in update
[2023-10-24 18:53:21]     self._store_processeddata(self.datasource.get_userdata(), "userdata")
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 599, in get_userdata
[2023-10-24 18:53:21]     self.userdata = self.ud_proc.process(self.get_userdata_raw())
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 88, in process
[2023-10-24 18:53:21]     self._process_msg(convert_string(blob), accumulating_msg)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 159, in _process_msg
[2023-10-24 18:53:21]     self._do_include(payload, append_msg)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 264, in _do_include
[2023-10-24 18:53:21]     _handle_error(message, urle)
[2023-10-24 18:53:21]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 72, in _handle_error
[2023-10-24 18:53:21]     raise RuntimeError(error_message) from source_exception
[2023-10-24 18:53:21] RuntimeError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt' for url: file:///etc/secret-userdata.txt
[2023-10-24 18:53:21] ------------------------------------------------------------
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 running 'modules:config' at Tue, 24 Oct 2023 18:53:37 +0000. Up 42.69 seconds.
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 running 'modules:final' at Tue, 24 Oct 2023 18:53:40 +0000. Up 46.25 seconds.
[2023-10-24 18:53:40] Cloud-init v. 23.3.1-0ubuntu1~20.04.1 finished at Tue, 24 Oct 2023 18:53:40 +0000. Datasource DataSourceEc2Local.  Up 46.42 second

Environment:

Project (Image Builder for Cluster API:

Additional info for Image Builder for Cluster API related issues:

  • OS (e.g. from /etc/os-release, or cmd /c ver): ubuntu-20.04
  • Packer Version:
  • Packer Provider:
  • Ansible Version:
  • Cluster-api version (if using):
  • Kubernetes version: (use kubectl version):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 25, 2023
@supershal
Copy link
Author

we were able to downgrade the cloud-init to 23.2.1-0ubuntu0~20.04.2 and create cluster successfully. mesosphere/konvoy-image-builder#938
cc: @voor @cnmcavoy

We are still not sure of the root cause and change in cloud-init that resulted in this issue.

@supershal
Copy link
Author

supershal commented Oct 27, 2023

I was able to provide following override file to the image-builder and build AMI that can run CAPA cloud-init script successfully.
pin-cloud-init-override.json :

{
    "ansible_extra_vars": "pinned_debs=\"cloud-init=23.1.2-0ubuntu0~20.04.2\""
}

I built the image using following makefile target of image-builder
make build-ami-ubuntu-2004 PACKER_VAR_FILES=pin-cloud-init-override.json

We will have to now investigate what changes in 23.3.1-0ubuntu1~20.04.1 broke the CAPA cloud-init script.

@voor
Copy link
Member

voor commented Oct 31, 2023

Moving over some comments from slack so they're not lost in the sands of time:

- name: Downgrade cloud init.
  apt:
    deb: http://launchpadlibrarian.net/679992659/cloud-init_23.2.2-0ubuntu0~20.04.1_all.deb
    state: present
    force: true

- name: Pin cloud init to prevent version issues.
  dpkg_selections:
    name: "{{ item }}"
    selection: hold
  loop:
    - cloud-init

@dlipovetsky
Copy link

For image-builder users who have hit this bug and are reading this issue:

We believe the root cause to be in cloud-init, and would like to fix it there (see canonical/cloud-init#4572). We prefer to do this to the alternative, which is to "pin" an older, known-good cloud-init version in image-builder itself.

For now, if you use image-builder to create an Ubuntu 20.04 AMI, please use the workaround described in #1333 (comment).

@dlipovetsky
Copy link

dlipovetsky commented Jan 17, 2024

This might be related to #406 which historically caused issues with CAPA.

@supershal and I found that the feature override mechanism used in #406 does not work in the recent versions of cloud-init in Ubuntu 20.04. This mechanism was removed from cloud-init in canonical/cloud-init#4228.

Patching cloud-init is the officially documented mechanism now:

Currently used upstream values for feature flags are set in cloudinit/features.py. Overrides to these values should be patched directly (e.g., via quilt patch) by downstreams.

I guess modifying the cloud-init python module to set ERROR_ON_USER_DATA_FAILURE = False is something image-builder can do for now. But once Ubuntu 20.04 is EOL, the feature flag itself will be removed.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 16, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants