-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failed on "main": wicket test_inventory #6300
Labels
Test Flake
Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Comments
davepacheco
added
the
Test Flake
Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
label
Aug 12, 2024
this test passed on my local machine on commit |
@faithanalog do we have enough information from the data that gets saved on failure to debug it? |
Naively, my guess would be a synchronization issue, may need to insert a sync barrier or just retries somewhere. |
This failed on the rel/v10 branch, it seems, as well: |
#6456 is a proposed fix. |
sunshowers
added a commit
that referenced
this issue
Aug 27, 2024
I haven't been able to reproduce this locally, but this is my best guess as to what's going wrong here: MGS/wicketd learns about SPs but due to a race/load on the system, misses out on populating their state and instead leaves it empty. That causes the SPs to be filtered out here: https://github.com/oxidecomputer/omicron/blob/7a6f45c5504bb092ce738d165cc88736ba4a9092/wicketd/src/rss_config.rs#L129 This theory is buttressed by the fact that in failing logs, the returned inventory is a lot smaller than what I'm seeing locally. For example, in the logs for [this failing test](https://buildomat.eng.oxide.computer/wg/0/details/01J69AR918WAQNFKSBS85EAQPV/kkFMDYhAM3Vxb5ujRHlyAO9thmIAc7mHjHuicct0gS2bL8xu/01J69ARHYXXSKXKG8J49SRZVTA) I see [a 1430 byte response](https://buildomat.eng.oxide.computer/wg/0/artefact/01J69AR918WAQNFKSBS85EAQPV/kkFMDYhAM3Vxb5ujRHlyAO9thmIAc7mHjHuicct0gS2bL8xu/01J69ARHYXXSKXKG8J49SRZVTA/01J69ENP3EF3A212GVAGEMBDVQ/mod-ff551cc639cd8d16-test_inventory.21679.0.log?format=x-bunyan#L640): ``` test_inventory (wicketd test client): client response result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(::1)), port: Some(45364), path: "/inventory", query: None, fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "e68141e2-4c4f-46ec-a49b-9f8aa11a3410", "content-length": "1430", "date": "Tue, 27 Aug 2024 08:13:01 GMT"} }) ``` But in passing runs locally, I see a much larger 8654 byte response ([full logs](https://gist.github.com/sunshowers/b9c1868ba4c8c4bd3eec49cc4b56516d)): ``` 19:32:43.847Z DEBG test_inventory (wicketd test client): client response result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(::1)), port: Some(44183), path: "/inventory", query: None, fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "8b48dae0-025d-426a-82f0-1dd8323670d5", "content-length": "8654", "date": "Tue, 27 Aug 2024 19:32:43 GMT"} }) ``` Based on this theory, this PR changes the exit condition for the poll loop to also consider all of the SP states being present. In case there's something else going on, the PR also adds a bunch of additional logging. Fixes #6300.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Here's a test failure from "main": https://github.com/oxidecomputer/omicron/runs/28677241691
Log: https://buildomat.eng.oxide.computer/wg/0/details/01J549359YCSA4Y921PHPZGQVM/kTzmJcz1Mc8NSLB99WIu2CWqB1KFBnc7diulpdI2Zn5NXKBa/01J5493FW2PTJ2JW51X2Q0RN9J
Excerpt:
Given that this passed on the parent commit as well as on this PR itself, I suspect this is a flake, but I haven't looked into it at all.
The text was updated successfully, but these errors were encountered: