Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test flake: failure to sync NTP #4307

Closed
jgallagher opened this issue Oct 20, 2023 · 3 comments
Closed

Test flake: failure to sync NTP #4307

jgallagher opened this issue Oct 20, 2023 · 3 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@jgallagher
Copy link
Contributor

https://github.com/oxidecomputer/omicron/pull/4305/checks?check_run_id=17908233198 failed on an innocuous change (bumping the SP version, which is not used in CI tests at all yet). The specific error that triggered the failure is a sequence of failures to log in, which eventually times out (~10 minutes):

...
2023-10-20 17:42:26.709172171 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:42:27.711043866 UTC: attempting to log into API
2023-10-20 17:42:42.743305427 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:42:43.744381066 UTC: attempting to log into API
2023-10-20 17:42:58.777480954 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:42:59.778226376 UTC: attempting to log into API
2023-10-20 17:43:14.810998417 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:43:15.811739399 UTC: attempting to log into API
2023-10-20 17:43:30.843854357 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
2023-10-20 17:43:31.844650630 UTC: attempting to log into API
2023-10-20 17:43:46.878150046 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: operation timed out
Error: logging in

Caused by:
    timed out after 609.29703172s

No Nexus logs were created; from sled-agent's logs, it looks like we never got time synchronized. The first time sync log messages start at 17:33:

17:33:38.278Z INFO SledAgent (RSS): Waiting for rack time synchronization
    file = sled-agent/src/rack_setup/service.rs:468
17:33:38.311Z INFO SledAgent (RSS): Checking time synchronization for [fd00:1122:3344:101::1]:12345...
    file = sled-agent/src/rack_setup/service.rs:448
17:33:38.320Z INFO SledAgent (ServiceManager): chronyc command failed: Error running command in zone 'oxz_ntp_1f3dd88e-2680-4aed-ad69-acd5b518c0b2': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1  stdout: 506 Cannot talk to daemon
      stderr:
    file = sled-agent/src/services.rs:2473
17:33:38.323Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:498
17:33:38.443Z INFO SledAgent (RSS): Checking time synchronization for [fd00:1122:3344:101::1]:12345...
    file = sled-agent/src/rack_setup/service.rs:448
17:33:38.453Z INFO SledAgent (RSS): Timesync for [fd00:1122:3344:101::1]:12345 TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, ref_time: 1697823217.262026, correction: 0.0 }
    file = sled-agent/src/rack_setup/service.rs:476
17:33:38.453Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:498

and the last time sync log message is just under 10 minutes later:

17:42:57.904Z INFO SledAgent (RSS): Checking time synchronization for [fd00:1122:3344:101::1]:12345...
    file = sled-agent/src/rack_setup/service.rs:448
17:42:57.918Z INFO SledAgent (RSS): Timesync for [fd00:1122:3344:101::1]:12345 TimeSync { sync: false, ref_id: 2139029761, ip_addr: ::, stratum: 10, ref_time: 1697823776.7655292, correction: 0.0 }
    file = sled-agent/src/rack_setup/service.rs:476
17:42:57.918Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:498

The chrony tracking.log seems relatively unhelpful?

===================================================================================================================================
   Date (UTC) Time     IP Address   St   Freq ppm   Skew ppm     Offset L Co  Offset sd Rem. corr. Root delay Root disp. Max. error
===================================================================================================================================
2023-10-20 17:33:38 0.0.0.0          0     -0.000 1000000.000  0.000e+00 ?  0  0.000e+00  0.000e+00  1.000e+00  1.000e+00  1.500e+00
@jgallagher jgallagher added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Oct 20, 2023
@bnaecker
Copy link
Collaborator

Hit the same flake a few times during CI runs for #4298, see here for the check run.

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2023

Hit the same flake a few times during CI runs for #4306 , see here for the check run.

@luqmana
Copy link
Contributor

luqmana commented Oct 21, 2023

From chat: this was due to a mismatch in the sidecar-lite version getting pulled in. #3986 (now in main) includes a change to create_virtual_hardware.sh which should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

4 participants