Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed on "main": wicket test_inventory #6300

Closed
davepacheco opened this issue Aug 12, 2024 · 5 comments · Fixed by #6456
Closed

test failed on "main": wicket test_inventory #6300

davepacheco opened this issue Aug 12, 2024 · 5 comments · Fixed by #6456
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@davepacheco
Copy link
Collaborator

Here's a test failure from "main": https://github.com/oxidecomputer/omicron/runs/28677241691
Log: https://buildomat.eng.oxide.computer/wg/0/details/01J549359YCSA4Y921PHPZGQVM/kTzmJcz1Mc8NSLB99WIu2CWqB1KFBnc7diulpdI2Zn5NXKBa/01J5493FW2PTJ2JW51X2Q0RN9J

Excerpt:

--- STDERR:              wicketd::mod integration_tests::inventory::test_inventory ---
log file: /var/tmp/omicron_tmp/mod-0a881862c5602f71-test_inventory.20657.0.log
note: configured to log to "/var/tmp/omicron_tmp/mod-0a881862c5602f71-test_inventory.20657.0.log"
thread 'integration_tests::inventory::test_inventory' panicked at wicketd/tests/integration_tests/inventory.rs:84:9:
assertion `left == right` failed
  left: []
 right: [BootstrapSledDescription { id: SpIdentifier { slot: 0, type_: Sled }, baseboard: Gimlet { identifier: "SimGimlet00", model: "i86pc", revision: 0 }, bootstrap_ip: None }, BootstrapSledDescription { id: SpIdentifier { slot: 1, type_: Sled }, baseboard: Gimlet { identifier: "SimGimlet01", model: "i86pc", revision: 0 }, bootstrap_ip: None }]

Given that this passed on the parent commit as well as on this PR itself, I suspect this is a flake, but I haven't looked into it at all.

@davepacheco davepacheco added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Aug 12, 2024
@faithanalog
Copy link
Contributor

this test passed on my local machine on commit 914f5fd7d51f9b060dcc0382a30b607e25df49b2 so i also think it is a flake

@davepacheco
Copy link
Collaborator Author

@faithanalog do we have enough information from the data that gets saved on failure to debug it?

@sunshowers
Copy link
Contributor

sunshowers commented Aug 15, 2024

Another failure: https://buildomat.eng.oxide.computer/wg/0/details/01J5BVA4K80CA7M27G0KKNK0DB/t4N5e5yPiF1oiJjAD8kNDVy99ybJF9rsu2TkBc2XaQL4YSx5/01J5BVAQZJFZKMYF2SD0FWA991#S5563

Naively, my guess would be a synchronization issue, may need to insert a sync barrier or just retries somewhere.

@jclulow
Copy link
Collaborator

jclulow commented Aug 27, 2024

@sunshowers
Copy link
Contributor

#6456 is a proposed fix.

sunshowers added a commit that referenced this issue Aug 27, 2024
I haven't been able to reproduce this locally, but this is my best guess
as to what's going wrong here: MGS/wicketd learns about SPs but due to a
race/load on the system, misses out on populating their state and
instead leaves it empty. That causes the SPs to be filtered out here:

https://github.com/oxidecomputer/omicron/blob/7a6f45c5504bb092ce738d165cc88736ba4a9092/wicketd/src/rss_config.rs#L129

This theory is buttressed by the fact that in failing logs, the returned
inventory is a lot smaller than what I'm seeing locally.

For example, in the logs for [this failing
test](https://buildomat.eng.oxide.computer/wg/0/details/01J69AR918WAQNFKSBS85EAQPV/kkFMDYhAM3Vxb5ujRHlyAO9thmIAc7mHjHuicct0gS2bL8xu/01J69ARHYXXSKXKG8J49SRZVTA)
I see [a 1430 byte
response](https://buildomat.eng.oxide.computer/wg/0/artefact/01J69AR918WAQNFKSBS85EAQPV/kkFMDYhAM3Vxb5ujRHlyAO9thmIAc7mHjHuicct0gS2bL8xu/01J69ARHYXXSKXKG8J49SRZVTA/01J69ENP3EF3A212GVAGEMBDVQ/mod-ff551cc639cd8d16-test_inventory.21679.0.log?format=x-bunyan#L640):

```
test_inventory (wicketd test client): client response
    result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(::1)), port: Some(45364), path: "/inventory", query: None, fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "e68141e2-4c4f-46ec-a49b-9f8aa11a3410", "content-length": "1430", "date": "Tue, 27 Aug 2024 08:13:01 GMT"} })
```

But in passing runs locally, I see a much larger 8654 byte response
([full
logs](https://gist.github.com/sunshowers/b9c1868ba4c8c4bd3eec49cc4b56516d)):

```
19:32:43.847Z DEBG test_inventory (wicketd test client): client response
    result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(::1)), port: Some(44183), path: "/inventory", query: None, fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "8b48dae0-025d-426a-82f0-1dd8323670d5", "content-length": "8654", "date": "Tue, 27 Aug 2024 19:32:43 GMT"} })
```

Based on this theory, this PR changes the exit condition for the poll
loop to also consider all of the SP states being present.

In case there's something else going on, the PR also adds a bunch of
additional logging.

Fixes #6300.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants