Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failure: bulk_sp_get_one_sp_powered_off #951

Closed
davepacheco opened this issue Apr 19, 2022 · 2 comments
Closed

test failure: bulk_sp_get_one_sp_powered_off #951

davepacheco opened this issue Apr 19, 2022 · 2 comments

Comments

@davepacheco
Copy link
Collaborator

Found in https://github.com/oxidecomputer/omicron/runs/6062888362?check_suite_focus=true#step:12:778 under #942:

     Running `/Users/runner/work/omicron/omicron/target/debug/deps/test_all-f0bf36f3c61b9c15`

running 6 tests
test integration_tests::bulk_state_get::bulk_sp_get_one_sp_powered_off ... FAILED
test integration_tests::bulk_state_get::bulk_sp_get_all_online ... ok
test integration_tests::commands::test_gateway_openapi_sled ... ok
test integration_tests::bulk_state_get::bulk_sp_get_one_sp_unresponsive ... ok
test integration_tests::serial_console::serial_console_detach ... ok
test integration_tests::serial_console::serial_console_communication ... ok

failures:

---- integration_tests::bulk_state_get::bulk_sp_get_one_sp_powered_off stdout ----
log file: "/tmp/omicron_tmp/test_all-f0bf36f3c61b9c15-bulk_sp_get_all_online.5369.0.log"
note: configured to log to "/tmp/omicron_tmp/test_all-f0bf36f3c61b9c15-bulk_sp_get_all_online.5369.0.log"
thread 'integration_tests::bulk_state_get::bulk_sp_get_one_sp_powered_off' panicked at 'assertion failed: `(left == right)`
  left: `204`,
 right: `500`', /Users/runner/.cargo/git/checkouts/dropshot-a4a923d29dccc492/da09c39/dropshot/src/test_util.rs:220:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    integration_tests::bulk_state_get::bulk_sp_get_one_sp_powered_off

test result: FAILED. 5 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 3.00s

The outputs were successfully uploaded:
https://github.com/oxidecomputer/omicron/runs/6062888362?check_suite_focus=true#step:13:28

Here's the end of the output file:

dap@zathras tmp $ bunyan test_all-f0bf36f3c61b9c15-bulk_sp_get_all_online.5369.0.log [2022-04-18T13:39:07.482916Z]  INFO: bulk_sp_get_all_online/5369 on Mac-1650287278361.local: setting up simualted sidecar (slot="sidecar 0")
[2022-04-18T13:39:07.483807Z]  INFO: bulk_sp_get_all_online/5369 on Mac-1650287278361.local: setting up simualted gimlet (slot="gimlet 0")
[2022-04-18T13:39:07.486855Z]  INFO: bulk_sp_get_all_online/5369 on Mac-1650287278361.local: setting up simualted gimlet (slot="gimlet 1")
[2022-04-18T13:39:07.489108Z]  INFO: 2d6d80f8-5967-4973-9df4-c457e91618fa/5369 on Mac-1650287278361.local: setting up gateway server
[2022-04-18T13:39:07.490474Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/5369 on Mac-1650287278361.local: successfully registered DTrace USDT probes
[2022-04-18T13:39:07.491667Z]  INFO: 2d6d80f8-5967-4973-9df4-c457e91618fa/SpCommunicator/5369 on Mac-1650287278361.local: started SP communicator
[2022-04-18T13:39:07.493156Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/ignition, method=GET)
[2022-04-18T13:39:07.495031Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/ignition/{type}/{slot}, method=GET)
[2022-04-18T13:39:07.496276Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp, method=GET)
[2022-04-18T13:39:07.497352Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}, method=GET)
[2022-04-18T13:39:07.498426Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/component, method=GET)
[2022-04-18T13:39:07.499555Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/component/{component}, method=GET)
[2022-04-18T13:39:07.501284Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/component/{component}/power_off, method=POST)
[2022-04-18T13:39:07.503046Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/component/{component}/power_on, method=POST)
[2022-04-18T13:39:07.504299Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, method=GET)
    path: /sp/{type}/{slot}/component/{component}/serial_console/attach
[2022-04-18T13:39:07.505363Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, method=POST)
    path: /sp/{type}/{slot}/component/{component}/serial_console/detach
[2022-04-18T13:39:07.506144Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/component/{component}/update, method=POST)
[2022-04-18T13:39:07.506876Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/power_off, method=POST)
[2022-04-18T13:39:07.507623Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: registered endpoint (local_addr=127.0.0.1:49412, path=/sp/{type}/{slot}/power_on, method=POST)
[2022-04-18T13:39:07.508353Z]  INFO: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: listening (local_addr=127.0.0.1:49412)
[2022-04-18T13:39:07.508965Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: successfully registered DTrace USDT probes (local_addr=127.0.0.1:49412)
[2022-04-18T13:39:07.509512Z]  INFO: bulk_sp_get_all_online/client test context/5369 on Mac-1650287278361.local: client request (body=Body(Empty), method=POST)
    uri: http://127.0.0.1:49412http://127.0.0.1:49412/sp/sled/0/power_off
[2022-04-18T13:39:07.510175Z]  INFO: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: accepted connection (local_addr=127.0.0.1:49412, remote_addr=127.0.0.1:49413)
[2022-04-18T13:39:07.510777Z] TRACE: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: incoming request (req_id=4e11f62a-f387-4037-a572-f4fe0973268e, uri=http://127.0.0.1:49412/sp/sled/0/power_off, method=POST, remote_addr=127.0.0.1:49413, local_addr=127.0.0.1:49412)
[2022-04-18T13:39:07.996817Z] DEBUG: 2d6d80f8-5967-4973-9df4-c457e91618fa/SpCommunicator/5369 on Mac-1650287278361.local: sending Request { version: 1, request_id: 0, kind: IgnitionCommand { target: 1, command: PowerOff } } to SP SpSocket { socket: PollEvented { io: Some(UdpSocket { addr: 127.0.0.1:56453, fd: 36 }) }, addr: 127.0.0.1:65481, port: SwitchPort(0) }
[2022-04-18T13:39:07.997286Z]  INFO: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: request completed (req_id=4e11f62a-f387-4037-a572-f4fe0973268e, uri=http://127.0.0.1:49412/sp/sled/0/power_off, method=POST, remote_addr=127.0.0.1:49413, local_addr=127.0.0.1:49412, error_message_external="Internal Server Error", error_message_internal="timeout elapsed", response_code=500)
[2022-04-18T13:39:07.998064Z] DEBUG: bulk_sp_get_all_online/5369 on Mac-1650287278361.local: received ignition command PowerOff for target 1; sending ack (slot="sidecar 0")
[2022-04-18T13:39:07.998467Z]  INFO: bulk_sp_get_all_online/client test context/5369 on Mac-1650287278361.local: client received response (status=500)

It looks like a timeout. If we grep for only the messages for this request-id:

[2022-04-18T13:39:07.510777Z] TRACE: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: incoming request (req_id=4e11f62a-f387-4037-a572-f4fe0973268e, uri=http://127.0.0.1:49412/sp/sled/0/power_off, method=POST, remote_addr=127.0.0.1:49413, local_addr=127.0.0.1:49412)
[2022-04-18T13:39:07.997286Z]  INFO: 2d6d80f8-5967-4973-9df4-c457e91618fa/dropshot/5369 on Mac-1650287278361.local: request completed (req_id=4e11f62a-f387-4037-a572-f4fe0973268e, uri=http://127.0.0.1:49412/sp/sled/0/power_off, method=POST, remote_addr=127.0.0.1:49413, local_addr=127.0.0.1:49412, error_message_external="Internal Server Error", error_message_internal="timeout elapsed", response_code=500)

It looks like the timeout was at most 487ms. It may be that needs to be bumped up, at least for the test suite?

It'd also be helpful if the internal message said what the timeout was for and how long it waited.

@jgallagher does it make sense for you to take a look at this one?

@jgallagher
Copy link
Contributor

Yep, looking at this now.

@jgallagher
Copy link
Contributor

This should be resolved by #953 (both by increasing the test timeouts significantly, and if it reoccurs the logging will be much more informative).

leftwo pushed a commit that referenced this issue Oct 4, 2023
Crucible updates
    all Crucible connections should set TCP_NODELAY (#983)
    Use a fixed size for tag and nonce (#957)
    Log crucible opts on start, order crutest options (#974)
    Lock the Downstairs less (#966)
    Cache dirty flag locally, reducing SQLite operations (#970)
    Make stats mutex synchronous (#961)
    Optimize requeue during flow control conditions (#962)
    Update Rust crate base64 to 0.21.4 (#950)
    Do less in control (#949)
    Fix --flush-per-blocks (#959)
    Fast dependency checking (#916)
    Update actions/checkout action to v4 (#960)
    Use `cargo hakari` for better workspace deps (#956)
    Update actions/checkout digest to 8ade135 (#939)
    Cache block size in Guest (#947)
    Update Rust crate ringbuffer to 0.15.0 (#954)
    Update Rust crate toml to 0.8 (#955)
    Update Rust crate reedline to 0.24.0 (#953)
    Update Rust crate libc to 0.2.148 (#952)
    Update Rust crate indicatif to 0.17.7 (#951)
    Remove unused async (#943)
    Use a synchronous mutex for bw/iop_tokens (#946)
    Make flush ID non-locking (#945)
    Use `oneshot` channels instead of `mpsc` for notification (#918)
    Use a strong type for upstairs negotiation (#941)
    Add a "dynamometer" option to crucible-downstairs (#931)
    Get new work and active count in one lock (#938)
    A bunch of misc test cleanup stuff (#937)
    Wait for a snapshot to finish on all downstairs (#920)
    dsc and clippy cleanup. (#935)
    No need to sort ackable_work (#934)
    Use a strong type for repair ID (#928)
    Keep new jobs sorted (#929)
    Remove state_count function on Downstairs (#927)
    Small cleanup to IOStateCount (#932)
    let cmon and IOStateCount use ClientId (#930)
    Fast return for zero length IOs (#926)
    Use a strong type for client ID (#925)
    A few Crucible Agent fixes (#922)
    Use a newtype for `JobId` (#919)
    Don't pass MutexGuard into functions (#917)
    Crutest updates, rename tests, new options (#911)

Propolis updates
    Update tungstenite crates to 0.20
    Use `strum` crate for enum-related utilities
    Wire up bits for CPUID customization
    PHD: improve artifact store (#529)
    Revert abort-on-panic in 'dev' cargo profile
leftwo added a commit that referenced this issue Oct 5, 2023
Crucible updates
    all Crucible connections should set TCP_NODELAY (#983)
    Use a fixed size for tag and nonce (#957)
    Log crucible opts on start, order crutest options (#974)
    Lock the Downstairs less (#966)
    Cache dirty flag locally, reducing SQLite operations (#970)
    Make stats mutex synchronous (#961)
    Optimize requeue during flow control conditions (#962)
    Update Rust crate base64 to 0.21.4 (#950)
    Do less in control (#949)
    Fix --flush-per-blocks (#959)
    Fast dependency checking (#916)
    Update actions/checkout action to v4 (#960)
    Use `cargo hakari` for better workspace deps (#956)
    Update actions/checkout digest to 8ade135 (#939)
    Cache block size in Guest (#947)
    Update Rust crate ringbuffer to 0.15.0 (#954)
    Update Rust crate toml to 0.8 (#955)
    Update Rust crate reedline to 0.24.0 (#953)
    Update Rust crate libc to 0.2.148 (#952)
    Update Rust crate indicatif to 0.17.7 (#951)
    Remove unused async (#943)
    Use a synchronous mutex for bw/iop_tokens (#946)
    Make flush ID non-locking (#945)
    Use `oneshot` channels instead of `mpsc` for notification (#918)
    Use a strong type for upstairs negotiation (#941)
    Add a "dynamometer" option to crucible-downstairs (#931)
    Get new work and active count in one lock (#938)
    A bunch of misc test cleanup stuff (#937)
    Wait for a snapshot to finish on all downstairs (#920)
    dsc and clippy cleanup. (#935)
    No need to sort ackable_work (#934)
    Use a strong type for repair ID (#928)
    Keep new jobs sorted (#929)
    Remove state_count function on Downstairs (#927)
    Small cleanup to IOStateCount (#932)
    let cmon and IOStateCount use ClientId (#930)
    Fast return for zero length IOs (#926)
    Use a strong type for client ID (#925)
    A few Crucible Agent fixes (#922)
    Use a newtype for `JobId` (#919)
    Don't pass MutexGuard into functions (#917)
    Crutest updates, rename tests, new options (#911)

Propolis updates
    Update tungstenite crates to 0.20
    Use `strum` crate for enum-related utilities
    Wire up bits for CPUID customization
    PHD: improve artifact store (#529)
    Revert abort-on-panic in 'dev' cargo profile

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants