Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tee: /usr/local/bin/k0s: Text file busy #357

Open
twz123 opened this issue Mar 25, 2022 · 8 comments · Fixed by #366
Open

tee: /usr/local/bin/k0s: Text file busy #357

twz123 opened this issue Mar 25, 2022 · 8 comments · Fixed by #366
Labels
bug Something isn't working

Comments

@twz123
Copy link
Member

twz123 commented Mar 25, 2022

Upgrading a cluster from one node to three nodes failed with the following log line:

level=fatal msg="upload failed: Process exited with status 1 (tee: /usr/local/bin/k0s: Text file busy\n)"

Target OS: Alpine 3.15
k0sctl version: 0.13.0-rc.1-1-gaf2f60b (af2f60b)
k0sctl.log

A second run of k0sctl also fails because it tries to join new controllers by requesting a token from the wrong node (the newly created one which hasn't been joined):

time="25 Mar 22 10:29 CET" level=debug msg="[ssh] 10.83.134.16:22: executing sudo -s /usr/local/bin/k0s token create --role controller --expiry 10m0s"

Logs from the second run: k0sctl_2.log

Config used:

"apiVersion": "k0sctl.k0sproject.io/v1beta1"
"kind": "Cluster"
"metadata":
  "name": "k0s-cluster"
"spec":
  "hosts":
  - "files":
    - "dstDir": "/var/lib/k0s/images/"
      "name": "bundle-file"
      "perm": "0755"
      "src": "airgap-images.tar"
    "role": "controller+worker"
    "ssh":
      "address": "10.83.134.28"
      "keyPath": "id_rsa"
      "port": 22
      "user": "k0s"
    "uploadBinary": true
  - "files":
    - "dstDir": "/var/lib/k0s/images/"
      "name": "bundle-file"
      "perm": "0755"
      "src": "airgap-images.tar"
    "role": "controller+worker"
    "ssh":
      "address": "10.83.134.16"
      "keyPath": "id_rsa"
      "port": 22
      "user": "k0s"
    "uploadBinary": true
  - "files":
    - "dstDir": "/var/lib/k0s/images/"
      "name": "bundle-file"
      "perm": "0755"
      "src": "airgap-images.tar"
    "role": "controller+worker"
    "ssh":
      "address": "10.83.134.66"
      "keyPath": "id_rsa"
      "port": 22
      "user": "k0s"
    "uploadBinary": true
  "k0s":
    "config":
      "spec":
        "telemetry":
          "enabled": false
    "version": "v1.23.3+k0s.1"
@kke kke added the bug Something isn't working label Mar 25, 2022
@twz123
Copy link
Member Author

twz123 commented Mar 25, 2022

I was able to reproduce it (the "tee" error) without the upscale, i.e. running k0sctl again on a single node that has already been provisioned via a prior run of k0sctl.

I noticed that k0sctl wants to upgrade even if the target host is already running the correct version.

WARN [ssh] 10.83.134.135:22: k0s will be upgraded

@kke
Copy link
Contributor

kke commented Mar 25, 2022

I noticed that k0sctl wants to upgrade even if the target host is already running the correct version.

Yes, this always happens when k0sBinaryPath or files: is used because k0sctl didn't know if the file was changed. Now that k0sctl can detect local vs remote file changes, it should probably take this into consideration when deciding if the upgrade workflow should be chosen or not.

@kke
Copy link
Contributor

kke commented Mar 25, 2022

A second run of k0sctl also fails because it tries to join new controllers by requesting a token from the wrong node (the newly created one which hasn't been joined)

I wonder how this happens. The K0sLeader() should always pick a host that has k0s running.

@kke
Copy link
Contributor

kke commented Mar 25, 2022

tee: /usr/local/bin/k0s: Text file busy

The only possible explanation for this is that k0s is still running when trying to replace the binary.

@chattytak
Copy link

I had the exact same problem.
The first update was met with "tee: /usr/local/bin/k0s: Text file busy" and the k0s binary was removed from the node where the error occurred.
I then tried to update again using k0sctl, but failed when trying to do a token generation and join.
However, this relocated the k0s binary on the node, so after starting the service again with systemctl from the node, the update was performed again with k0sctl, and the process ended successfully.

@twz123
Copy link
Member Author

twz123 commented Mar 29, 2022

This is definitely some timing issue. There's the check if k0s is still running, but maybe this check just races when the actual process is about to terminate but not quite terminated. When rerunning k0sctl apply again (after some seconds), the binary can be uploaded again, but will fail later on when trying to invoke k0s install (#362).

I see multiple ways of fixing this:

  • Retry the operation a few times on error. Either any error or the "Text file busy" error specifically. The latter would mean that this may become locale aware and we need to enforce the C locale.
  • Use a temporary file, and move it into its destination afterwards, possibly after deleting the destination file before moving. As far as I understand, this shouldn't trigger the busy error.
  • Use the /proc filesystem, if available, to find all processes running the k0s binary
  • Use a PID file and check for the PID specifically
  • time.Sleep(10 * time.Second) The red army knife for timing issues 👀 Well, there must be a better way than this ...

@kke
Copy link
Contributor

kke commented Apr 4, 2022

Hmmm, this is a forced upgrade because of the presence of files. The "upload binaries" phase should be skipped because k0s is going to be upgraded. There's some error in the host selection logic in that phase.

@twz123
Copy link
Member Author

twz123 commented Apr 5, 2022

Reopening as this is not yet resolved.

@twz123 twz123 reopened this Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants