Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep the launchd daemon alive #45

Merged
merged 1 commit into from
Sep 18, 2024
Merged

Conversation

nirs
Copy link
Member

@nirs nirs commented Sep 16, 2024

Working with multiple lima clusters in the recent weeks I found that socket_vmnet is not running for unknown reason. The typical flow is trying to start the clusters, and lima hostagent fails with connection refused with /var/run/socket_vmnet. This happens to me one or more times in the same day. Trying to run a stress test creating and destroying the lima clusters 50 times fails after several runs and from the point of the failure, all runs failed.

The issue seems to be that socket_vmnet is stopped by launched because it seems to be idle and it is never started again. Adding the keep alive option eliminated this issue.

With this change the daemon is kept running and it should restart after failures.

Working with multiple lima clusters in the recent weeks I found that
socket_vmnet is not running for unknown reason. The typical flow is
trying to start the clusters, and lima hostagent fails with connection
refused with /var/run/socket_vmnet. This happens to me one or more times
in the same day. Trying to run a stress test creating and destroying the
lima clusters 50 times fails after several runs and from the point of
the failure, all runs failed.

The issue seems to be that socket_vmnet is stopped by launched because
it seems to be idle and it is never started again. Adding the keep alive
option eliminated this issue.

With this change the daemon is kept running and it should restart after
failures.

Signed-off-by: Nir Soffer <[email protected]>
Copy link
Member

@jandubois jandubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@jandubois
Copy link
Member

socket_vmnet is not running for unknown reason

launchd will clean up idle jobs when the system comes under load.

@nirs
Copy link
Member Author

nirs commented Sep 16, 2024

socket_vmnet is not running for unknown reason

launchd will clean up idle jobs when the system comes under load.

This makes sense, but I don't see how it can be idle when I'm running 3 clusters with many components communicating between the clusters (e.g. rbd mirroring, submariner tunnels, etc).

And worse terminating socket_vmnet breaks the lima vms network - after termination the hostagent is in a kind of busy loop logging errors about using a closed connection.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I don't think this launchd file is used by Lima?

@jandubois
Copy link
Member

Thanks, but I don't think this launchd file is used by Lima?

It is not used when you are using "managed" networks. In that case Lima is starting/stopping the daemon as-needed.

But Lima can also connect to an "unmanaged" network, where you just specify the socket address. I assumed that is what @nirs was doing, and in that case you would use the launchd.plist files to control the daemon.

@AkihiroSuda AkihiroSuda added this to the v1.1.5 milestone Sep 17, 2024
@nirs
Copy link
Member Author

nirs commented Sep 17, 2024

Thanks, but I don't think this launchd file is used by Lima?

It is not used when you are using "managed" networks. In that case Lima is starting/stopping the daemon as-needed.

But Lima can also connect to an "unmanaged" network, where you just specify the socket address. I assumed that is what @nirs was doing, and in that case you would use the launchd.plist files to control the daemon.

Right, this is how I use it. I feel safer when limactl does not have special permissions, and ensuring that running socket_vmnet from brew is safe require changing ownership on brew directories which is messy and breaks brew upgrades. Using sudo make install.bin install.launchd is easy enough for developers.

@nirs
Copy link
Member Author

nirs commented Sep 17, 2024

socket_vmnet is not running for unknown reason

launchd will clean up idle jobs when the system comes under load.

I think the issue was wrong handling of SIGPIPE #48.

You can see in the log from #43:

Accepted a connection (fd 7)
Closing a connection (fd 5)

Received signal 13
Closing a connection (fd 7)
Closing a connection (fd 6)
Initializing vmnet.framework (mode 1001)
* vmnet_subnet_mask: 255.255.255.0
* vmnet_mtu: 1500
* vmnet_end_address: 192.168.105.254
* vmnet_start_address: 192.168.105.1
* vmnet_interface_id: 36D40E7D-1C36-4B0F-9C63-12D7EDB81215
* vmnet_max_packet_size: 1514
* vmnet_nat66_prefix: fd9b:5a14:ba57:e3d3::
* vmnet_mac_address: 66:c1:18:9f:80:62
Accepted a connection (fd 5)
Accepted a connection (fd 6)
Accepted a connection (fd 7)
Closing a connection (fd 5)
Closing a connection (fd 7)
Closing a connection (fd 6)
Accepted a connection (fd 5)
Accepted a connection (fd 6)
Accepted a connection (fd 7)
Closing a connection (fd 7)
Closing a connection (fd 5)
Closing a connection (fd 6)
Accepted a connection (fd 5)
Accepted a connection (fd 6)
Accepted a connection (fd 7)
Closing a connection (fd 7)
Closing a connection (fd 6)
Closing a connection (fd 5)
Accepted a connection (fd 5)
Accepted a connection (fd 6)
Accepted a connection (fd 7)
Closing a connection (fd 5)
Closing a connection (fd 6)

Received signal 13
Closing a connection (fd 7)

I think the issue is:

  1. When we close a socket (probably when client disconnects) there may be another thread trying to write to the socket, since the current locking does not ensure that socket cannot be used during removal of the socket.
  2. Writing to closed socket trigger a SIGPIPE
  3. We handle the SIGPIPE as fatal error and exit
  4. With keep alive, launchd does not start socket_vmnet

The SIGPIPE issue is fixed in #49, but there may be other reasons for fatal failure.

@AkihiroSuda AkihiroSuda merged commit fe0c96f into lima-vm:master Sep 18, 2024
3 of 5 checks passed
@nirs nirs deleted the keep-alive branch November 20, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants