Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: helios/deploy CLI SendError #6771

Open
jgallagher opened this issue Oct 4, 2024 · 25 comments
Open

test failed in CI: helios/deploy CLI SendError #6771

jgallagher opened this issue Oct 4, 2024 · 25 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@jgallagher
Copy link
Contributor

This test failed on a CI run on "main":

https://github.com/oxidecomputer/omicron/runs/31088964323

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ

Excerpt from the log showing the failure:

976	2024-10-04T15:58:09.854Z	+ /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.2.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
977	2024-10-04T15:59:47.203Z	sending chunk to thread failed with SendError { .. }
978	2024-10-04T15:59:48.324Z	channel closed
@jgallagher jgallagher added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Oct 4, 2024
@jgallagher jgallagher changed the title test failed in CI: NAME_OF_TEST test failed in CI: helios/deploy CLI SendError Oct 4, 2024
@davepacheco
Copy link
Collaborator

Figuring that kind of failure might result from a Nexus crash, I took a look at the Nexus logs. I don't see any crashes but I do see some related-looking request failures. All the import-related request log entries seem to be in this log:
https://buildomat.eng.oxide.computer/wg/0/artefact/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ/01J9C46T7F5WYNKPWE02126CE3/oxide-nexus:default.log?format=x-bunyan

and I see a bunch of messages like this one:
https://buildomat.eng.oxide.computer/wg/0/artefact/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ/01J9C46T7F5WYNKPWE02126CE3/oxide-nexus:default.log?format=x-bunyan#L35948

2024-10-04T15:59:47.315Z	INFO	nexus (dropshot_external): request completed
    error_message_external = cannot import blocks with a bulk write for disk in state ImportReady
    error_message_internal = cannot import blocks with a bulk write for disk in state ImportReady
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:938
    latency_us = 451297
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 10.151.2.100:52890
    req_id = e74f85b8-5e08-495e-a56f-2239902eeaab
    response_code = 400
    uri = /v1/disks/debian11-boot/bulk-write?project=images

and a couple where the state is Finalizing or Detached.

@jmpesp
Copy link
Contributor

jmpesp commented Oct 4, 2024

I'm unable to reproduce this now, of course, though I noticed that the oxide-v0.1.0 binary that is grabbed during the deploy job is quite old:

james@atrium ~ $ curl -sSfL -I http://catacomb.eng.oxide.computer:12346/oxide-v0.1.0
...
Last-Modified: Tue, 05 Dec 2023 21:41:37 GMT
...

@jmpesp
Copy link
Contributor

jmpesp commented Oct 4, 2024

I was finally able to reproduce this using the old CLI binary:

[00:00:16] [█████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 954.50 MiB/2.00 GiB (75.49 MiB/s, 14s)
sending chunk to thread failed with SendError { .. }
channel closed

The trick was to set a --timeout to something very small:

./oxide-v0.1.0 --timeout 2 ...

causing the .disk_bulk_write_import call to timeout, causing the upload thread to drop the receiver. This makes me believe it's related to IO speed being very slow for this setup, as the normal timeout without the explicit argument is probably the progenitor default of 15 seconds!

In contrast, the latest CLI does responds to a timeout differently:

[00:00:17] [███████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 907.50 MiB/2.00 GiB (52.85 MiB/s, 22s
Communication Error: error sending request for url (http://fancyfeast:8080/v1/disks/crucible-tester-duster-disk?project=myproj): operation timed out: sending chunk to thread failed: channel closed

@jmpesp
Copy link
Contributor

jmpesp commented Oct 4, 2024

This is almost certainly the case: Nexus reports that a client disconnected during a request to the bulk write endpoint:

nexus (dropshot_external): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:887
    latency_us = 1546327
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 10.151.2.100:36568
    req_id = 6f78ea08-7945-43f1-8574-54f7b2b86ab8
    uri = /v1/disks/debian11-boot/bulk-write?project=images

@iliana
Copy link
Contributor

iliana commented Oct 8, 2024

Do we think updating the CLI binary is useful here, or will it similarly timeout?
And/or should we increase the timeout to handle the CI setup not being Gimlet-grade SSDs?

@jmpesp
Copy link
Contributor

jmpesp commented Oct 8, 2024

Do we think updating the CLI binary is useful here, or will it similarly timeout?

We probably should update it so we're testing the latest binary in general, but I suspect it will similarly time out.

should we increase the timeout to handle the CI setup not being Gimlet-grade SSDs?

I'm not sure: .github/buildomat/jobs/deploy.sh says that the target is lab-2.0-opte-0.33, and doesn't that mean that this is running on our hardware? Those shouldn't have disks so bad that it takes 15 seconds to write 512kb (!).

If I'm misreading this and it's running on AWS, are these t2.nano instances or something haha? Again, 15 seconds to write 512kb is huge, unless it's something like we don't have TRIM enabled and there's a big reclaim going on?

@iliana
Copy link
Contributor

iliana commented Oct 8, 2024

It is running on lab hardware, I'm just not sure what the SSDs in the machine are. (Also keep in mind that the ZFS pools used by the deployed control plane are file-backed.)

@leftwo
Copy link
Contributor

leftwo commented Oct 9, 2024

I'm using stock omicron bits on my bench gimlet and I can get this to happen 100% of the time:

EVT22200005 # oxide disk import --project alan --description "test2" --path /alan/cli/jammy-server-cloudimg-amd64.raw --disk test2
[00:00:55] [████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 377.50 MiB/2.20 GiB (3.53 MiB/s, 9m)sending chunk to thread failed with SendError { .. }
[00:00:55] [████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 378.50 MiB/2.20 GiB (3.53 MiB/s, 9m)
channel closed

I'm on omicron commit: 0640bb2
My bench gimlet is:

EVT22200005 # uname -a
SunOS EVT22200005 5.11 helios-2.0.22861 oxide i386 oxide

And, the cli:

EVT22200005 # oxide version
Oxide CLI 0.6.1+20240710.0
Built from commit: 4a896d13efa760c0e9120e5d82fb2b6b7f5a27da 
Oxide API: 20240821.0

I added a --timeout 60 to my import line, and I did get an upload to work.

@andrewjstone
Copy link
Contributor

@leftwo
Copy link
Contributor

leftwo commented Oct 9, 2024

I updated my bench gimlet to:

EVT22200005 # uname -a
SunOS EVT22200005 5.11 helios-2.0.22957 oxide i386 oxide

And, it still fails the same way.

@andrewjstone
Copy link
Contributor

I'm unclear if this is the same error, but it seems like it might be related:
https://github.com/oxidecomputer/omicron/pull/6810/checks?check_run_id=31332921093

702 | 2024-10-10T05:40:53.995Z | + /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.2.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
-- | -- | --
1703 | 2024-10-10T05:44:03.700Z | one of the upload threads failed
1704 | 2024-10-10T05:44:04.618Z | one of the upload threads failed

@jgallagher
Copy link
Contributor Author

I saw the same variant Andrew did on https://buildomat.eng.oxide.computer/wg/0/details/01J9Y2BWQWFB0HPRRFCWC6GHAN/HzxHMKpHk2lzsmDBBvStwfjrwTZfnljEjTH4TLjd2l3KTopv/01J9Y2CEDVND9FA5DK9G0AX5FY. That run is on #6822 which does modify the pantry code path, but I think the modifications it makes (changing how Nexus chooses a pantry) would result in a different kind of failure if they were wrong.

871     2024-10-11T16:02:24.918Z        + /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.1.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
872     2024-10-11T16:05:33.357Z        one of the upload threads failed
873     2024-10-11T16:05:34.447Z        one of the upload threads failed

@jgallagher
Copy link
Contributor Author

I hit this a few times in a4x2 yesterday. I'll try Alan's workaround of a higher timeout next time I see it and report back.

@hawkw
Copy link
Member

hawkw commented Oct 17, 2024

Observed this again on #6879 after it merged to main as commit e577765: https://buildomat.eng.oxide.computer/wg/0/details/01JADWA2KEP8QZFHFYKN64S72N/EMg9DuE9YnNp0draUTy1mUsUfXrDkrVzCz53bQfjKaG7Chwz/01JADWASA6T2RC0N248ZNJC5NT#S1312

Seems like the same variant John observed in #6771 (comment), but I figured I'd drop it here anyway in case it's useful.

@bnaecker
Copy link
Collaborator

I've hit this several times in a row trying to get #6889 merged. Not sure what the best path forward in this case is, but maybe we can bump the timeout as @leftwo suggested, in the helios/deploy job specifically?

@bnaecker
Copy link
Collaborator

In the interest of getting PRs merged more readily, I've increased the timeout to 60s in d69155e. This obviously sucks, and isn't a substitute for a deeper understanding of and better solution to the issue.

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2024

On the bench gimlet setup, I captured the pantry log while reproducing the error. The full log is here: /staff/cores/omicron-6771

Of interest, we have the initial creation and setup request:

15:24:52.335Z INFO crucible-pantry (dropshot): accepted connection                                                                        
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    remote_addr = [fd00:1122:3344:101::a]:38249                                                                                           
15:24:52.335Z INFO crucible-pantry (datafile): no entry exists for volume edda9e6e-c95d-4488-a73d-ec5bf91a121d, constructing...           
15:24:52.335Z INFO crucible-pantry (datafile): Upstairs starts                                                                            
15:24:52.335Z INFO crucible-pantry (datafile): Crucible Version: BuildInfo {                                                              
        version: "0.0.1",                                                                                                                 
        git_sha: "2b88ab88461fb06aaf2aab11c5e381a3cad25eac",                                                                              
        git_commit_timestamp: "2024-09-27T17:26:25.000000000Z",                                                                           
        git_branch: "main",                                                                                                               
        rustc_semver: "1.80.0",                                                                                                           
        rustc_channel: "stable",                                                                                                          
        rustc_host_triple: "x86_64-unknown-illumos",                                                                                      
        rustc_commit_sha: "051478957371ee0084a7c0913941d2a8c4757bb9",                                                                     
        cargo_triple: "x86_64-unknown-illumos",                                                                                           
        debug: false,                                                                                                                     
        opt_level: 3,                                                                                                                     
    }                                                                                                                                     
15:24:52.335Z INFO crucible-pantry (datafile): Upstairs <-> Downstairs Message Version: 11                                                
15:24:52.335Z INFO crucible-pantry (datafile): Using region definition RegionDefinition { block_size: 512, extent_size: Block { value: 131
072, shift: 9 }, extent_count: 48, uuid: 00000000-0000-0000-0000-000000000000, encrypted: true, database_read_version: 1, database_write_v
ersion: 1 }                                                                                                                               
15:24:52.336Z INFO crucible-pantry (datafile): Crucible edda9e6e-c95d-4488-a73d-ec5bf91a121d has session id: daff5380-9074-4e67-9d63-2597c
356c51f                                                                                                                                   
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.336Z INFO crucible-pantry (datafile): Upstairs opts: Upstairs UUID: edda9e6e-c95d-4488-a73d-ec5bf91a121d, Targets: [[fd00:1122:33
44:101::14]:19000, [fd00:1122:3344:101::12]:19000, [fd00:1122:3344:101::18]:19000], lossy: false, flush_timeout: None, key populated: true
,  cert_pem populated: false,  key_pem populated: false,  root_cert_pem populated: false,  Control: None,  read_only: false               
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.336Z INFO crucible-pantry (datafile): Crucible stats registered with UUID: edda9e6e-c95d-4488-a73d-ec5bf91a121d                  
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.380Z INFO crucible-pantry (datafile): volume edda9e6e-c95d-4488-a73d-ec5bf91a121d constructed ok                                 
15:24:52.380Z INFO crucible-pantry (datafile): The guest has requested activation                                                         
15:24:52.380Z INFO crucible-pantry (datafile): edda9e6e-c95d-4488-a73d-ec5bf91a121d active request set                                    
    session_id = daff5380-9074-4e67-9d63-2597c356c51f

Things start up and downstairs connect:

15:24:52.382Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:101::18]:19000) has region UUID 03f3231b-aacc-40c
1-a5fa-20dab91ee3f1                                                                                                                       
     = downstairs                                                                                                                         
    client = 2                                                                                                                            
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.382Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:101::12]:19000) has region UUID e5c795fe-4a2e-458
5-b9b0-c2d1221bbdfa                                                                                                                       
     = downstairs                                                                                                                         
    client = 1                                                                                                                            
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.382Z INFO crucible-pantry (datafile): downstairs client at Some([fd00:1122:3344:101::14]:19000) has region UUID f475816b-a491-4f4
d-befc-1ab264dcc997                                                                                                                       
     = downstairs                                                                                                                         
    client = 0                                                                                                                            
    session_id = daff5380-9074-4e67-9d63-2597c356c51f  

We activate, and bulk uploads start:

15:24:52.383Z INFO crucible-pantry (datafile): Set Active after no reconciliation                                                         
    session_id = daff5380-9074-4e67-9d63-2597c356c51f                                                                                     
15:24:52.383Z INFO crucible-pantry (datafile): The guest has finished waiting for activation                                              
15:24:52.383Z INFO crucible-pantry (datafile): volume edda9e6e-c95d-4488-a73d-ec5bf91a121d activated ok                                   
15:24:52.383Z INFO crucible-pantry (datafile): volume edda9e6e-c95d-4488-a73d-ec5bf91a121d constructed and inserted ok                    
15:24:52.383Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 47954                                                                                                                    
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:38249                                                                                           
    req_id = a413a283-ac89-4558-b1cf-d50fd4a72ea5                                                                                         
    response_code = 200                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d                                                                  
15:24:53.069Z INFO crucible-pantry (dropshot): accepted connection                                                                        
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
15:24:53.071Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 1526                                                                                                                     
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
    req_id = 8c73dbac-fe11-4432-b888-7dd051c86771                                                                                         
    response_code = 204                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write                                                       
15:24:53.545Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 1692                                                                                                                     
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
    req_id = 99ae709b-c811-4ce5-9fdb-19fe31bd0b26                                                                                         
    response_code = 204                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write                                                       
15:24:53.706Z INFO crucible-pantry (dropshot): request completed                                                                          
    latency_us = 1730                                                                                                                     
    local_addr = [fd00:1122:3344:101::11]:17000                                                                                           
    method = POST                                                                                                                         
    remote_addr = [fd00:1122:3344:101::a]:51951                                                                                           
    req_id = 01820eb1-7fc7-47f7-b338-12dae7ef2635                                                                                         
    response_code = 204                                                                                                                   
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write 

The bulk-write requests continue on an expected cadence, but then, it appears as if the pantry receives a message to disconnect from the downstairs:

15:25:12.018Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1537
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = 6ad44cf3-71cf-41c4-b4b9-d0b88f807659
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.344Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1500
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = 3c7332a6-6bf6-4fe4-8e70-46e5316f43a3
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.463Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1371
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = c26c225a-1195-4d75-8c43-e98323a1f180
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.600Z INFO crucible-pantry (dropshot): request completed
    latency_us = 1759
    local_addr = [fd00:1122:3344:101::11]:17000
    method = POST
    remote_addr = [fd00:1122:3344:101::a]:51951
    req_id = 9070f2c7-968d-4eda-88d6-2cf8c55a33a2
    response_code = 204
    uri = /crucible/pantry/0/volume/edda9e6e-c95d-4488-a73d-ec5bf91a121d/bulk-write
15:25:12.887Z INFO crucible-pantry (dropshot): accepted connection
    local_addr = [fd00:1122:3344:101::11]:17000
    remote_addr = [fd00:1122:3344:101::a]:63656
15:25:12.887Z INFO crucible-pantry (datafile): detach removing entry for volume edda9e6e-c95d-4488-a73d-ec5bf91a121d
15:25:12.887Z INFO crucible-pantry (datafile): detaching volume edda9e6e-c95d-4488-a73d-ec5bf91a121d
15:25:12.887Z INFO crucible-pantry (datafile): Request to deactivate this guest
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): checking for deactivation
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): [0] cannot deactivate, job 1175 in state InProgress
     = downstairs
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate client 0
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): [1] cannot deactivate, job 1175 in state InProgress
     = downstairs
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate client 1
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): [2] cannot deactivate, job 1175 in state InProgress
     = downstairs
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate client 2
    session_id = daff5380-9074-4e67-9d63-2597c356c51f
15:25:12.887Z INFO crucible-pantry (datafile): not ready to deactivate due to state Active
     = downstairs
    client = 0
    session_id = daff5380-9074-4e67-9d63-2597c356c51f

So, now the question is who is requesting the disconnect, and why?

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2024

In the nexus log during the same window (also at /staff/core/omicron-6771 )

I see the disk created okay.

I see the bulk import start:

15:24:53.068Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Modify
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:65319
    req_id = b166c760-4e4c-40a4-956a-c461f965743a
    resource = Disk { parent: Project { parent: Silo { parent: Fleet, key: b1085af4-1d7d-4af7-a113-21237bc973eb, lookup_type: ById(b1085af4-1d7d-4af7-a113-21237bc973eb) }, key: 58985ee6-5289-4e53-89dd-08823170e4d9, lookup_type: ByName("alan") }, key: edda9e6e-c95d-4488-a73d-ec5bf91a121d, lookup_type: ByName("jammysource") }
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:53.069Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 10485760 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448

I see bulk writes going:

15:24:53.538Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): roles
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:35065
    req_id = 0d0ba77c-5be1-4a7c-8520-8562a21846b8
    roles = RoleSet { roles: {(Silo, b1085af4-1d7d-4af7-a113-21237bc973eb, "admin")} }
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:53.542Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Modify
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:35065
    req_id = 0d0ba77c-5be1-4a7c-8520-8562a21846b8
    resource = Disk { parent: Project { parent: Silo { parent: Fleet, key: b1085af4-1d7d-4af7-a113-21237bc973eb, lookup_type: ById(b1085af4-1d7d-4af7-a113-21237bc973eb) }, key: 58985ee6-5289-4e53-89dd-08823170e4d9, lookup_type: ByName("alan") }, key: edda9e6e-c95d-4488-a73d-ec5bf91a121d, lookup_type: ByName("jammysource") }
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:53.542Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 119537664 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448
15:24:53.545Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 471595
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:35065
    req_id = 0d0ba77c-5be1-4a7c-8520-8562a21846b8
    response_code = 204
    uri = /v1/disks/jammysource/bulk-write?project=alan

Here are the last two bulk writes that Nexus sends:

15:25:12.461Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 333447168 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448
...
15:25:12.598Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): bulk write of 524288 bytes to offset 257425408 of disk edda9e6e-c95d-4488-a73d-ec5bf91a121d using pantry endpoint [fd00:1122:3344:101::11]:17000
    file = nexus/src/app/disk.rs:448

Those look to line up with the pantry logs for the last two writes it receives.

Next, it looks like Nexus has received a request to stopt the bulk writes:

15:25:12.614Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): roles
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    roles = RoleSet { roles: {} }
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan
15:25:12.615Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })   
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    resource = Database
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan
15:25:12.617Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): roles
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    roles = RoleSet { roles: {} }
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan
15:25:12.617Z DEBG 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: d1ab7560-ceee-482c-8caf-aaeb6f481499, silo_id: b1085af4-1d7d-4af7-a113-21237bc973eb, .. })   
    actor_id = d1ab7560-ceee-482c-8caf-aaeb6f481499
    authenticated = true
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    resource = Database
    result = Ok(())
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan

The remote addr: 192.168.1.199 is the place where I am running the import command.
So, possibly the CLI has requested the stop?

Eventually, we see Nexus send the detach to the pantry:

15:25:12.850Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (ServerContext): sending detach for disk edda9e6e-c95d-4488-a73d-ec5bf91a121d to endpoint http://[fd00:1122:3344:101::11]:17000
    file = nexus/src/app/sagas/common_storage.rs:103
    saga_id = 7db31a2e-75b9-449e-a2a4-85260b36b39b
    saga_name = finalize-disk

This detach lines up with when the pantry is told to detach the disk.

@leftwo
Copy link
Contributor

leftwo commented Oct 21, 2024

My CLI version is:

EVT22200005 # oxide version  
Oxide CLI 0.7.0+20240821.0
Built from commit: b5a932c1cd8a3f6e7143eb6ce27a4dc4e277104c 
Oxide API: 20241009.0

@jgallagher
Copy link
Contributor Author

More observations from your Nexus log file: The CLI starts 8 upload threads. Attempting to identify those 8 threads by their remote address looks like it's possible:

john@jeeves /staff/core/omicron-6771 $ grep /v1/disks/jammysource/bulk-write nexus-while-uploading.log | grep -v '"authorize result"' | grep -v RoleSet | looker | grep remote_addr | sort | uniq -c
  28     remote_addr = 192.168.1.199:35065
   1     remote_addr = 192.168.1.199:36880
  19     remote_addr = 192.168.1.199:47618
  11     remote_addr = 192.168.1.199:48630
  12     remote_addr = 192.168.1.199:53994
  32     remote_addr = 192.168.1.199:54907
   4     remote_addr = 192.168.1.199:55340
  25     remote_addr = 192.168.1.199:56019
  14     remote_addr = 192.168.1.199:65319

Two things jump out there: we have 9 remote addrs, and 2 of them have suspiciously low counts. The 1 from :36880 is the bulk-write-stop request:

15:25:12.661Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 68874
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:36880
    req_id = 142f65ee-44f7-4859-9887-15694d6042e6
    response_code = 204
    uri = /v1/disks/jammysource/bulk-write-stop?project=alan

Both the timestamps and the error message on the final request from the CLI thread at :55340 with a count of 4 are both interesting:

15:24:54.184Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 473424
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:55340
    req_id = 07799230-5866-4d82-92a1-9f15d1baa95d
    response_code = 204
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:54.651Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 466150
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:55340
    req_id = baecca36-fd6b-4a5b-b8c5-22aade045000
    response_code = 204
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:24:57.540Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 2887139
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:55340
    req_id = 178de514-6ffa-456e-bf69-38b886c7f4dc
    response_code = 204
    uri = /v1/disks/jammysource/bulk-write?project=alan
15:25:13.713Z INFO 86008183-61c4-455e-bc24-1f22ba25a647 (dropshot_external): request completed
    error_message_external = error streaming request body: error reading a body from connection
    error_message_internal = error streaming request body: error reading a body from connection
    file = /home/alan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:938
    latency_us = 12951189
    local_addr = 172.30.2.5:80
    method = POST
    remote_addr = 192.168.1.199:55340
    req_id = f1664977-a59e-4c8c-950f-f08213e1bb8d
    response_code = 400
    uri = /v1/disks/jammysource/bulk-write?project=alan

In particular:

  • The time between the last successful request completed on this thread (15:24:57.540Z) and the bulk-write-stop request (which completed at 15:25:12.661Z) is just over 15 seconds, which would be consistent with a progenitor timeout.
  • I'm not sure what to make of the error reading a body from connection. 6 of the 7 other threads have errors that are more in line with what I'd expect, although 1 of the 7 has the same error:
john@jeeves /staff/core/omicron-6771 $ grep /v1/disks/jammysource/bulk-write nexus-while-uploading.log | grep -v '"authorize result"' | grep -v RoleSet | looker | grep -A 5 'error_message_internal' | egrep 'error_message_internal|remote_addr'
    error_message_internal = cannot import blocks with a bulk write for disk in state ImportReady
    remote_addr = 192.168.1.199:65319
    error_message_internal = cannot import blocks with a bulk write for disk in state ImportReady
    remote_addr = 192.168.1.199:35065
    error_message_internal = cannot import blocks with a bulk write for disk in state Finalizing
    remote_addr = 192.168.1.199:54907
    error_message_internal = cannot import blocks with a bulk write for disk in state Detached
    remote_addr = 192.168.1.199:56019
    error_message_internal = cannot import blocks with a bulk write for disk in state Detached
    remote_addr = 192.168.1.199:47618
    error_message_internal = cannot import blocks with a bulk write for disk in state Detached
    remote_addr = 192.168.1.199:53994
    error_message_internal = error streaming request body: error reading a body from connection
    remote_addr = 192.168.1.199:55340
    error_message_internal = error streaming request body: error reading a body from connection
    remote_addr = 192.168.1.199:48630

@davepacheco
Copy link
Collaborator

After some focused debugging I think we've got a much better understanding of the problem here: it appears that:

  • the softnpu implementation uses libdlpi to process the data path traffic on the system
  • the libdlpi path uses STREAMS
  • this path is only willing to enqueue 4KiB of data in the kernel
  • this is causing lots of packets to be dropped
  • this is causing TCP retransmits + backoffs
  • in some cases, the elapsed time for the HTTP request reaches 15s
  • at that point, we hit the Progenitor client timeout and the client abandons the request and we see this error message

Alan was able to reliably reproduce this issue on a bench Gimlet and we did this investigation there. (@leftwo can you post here how you did that?) In this situation:

  • The CLI is using 8 tokio tasks to independently upload chunks of about 512 KiB using individual HTTP requests.
  • Each of the eight tasks uses its own TCP connection to sequentially upload chunks. It uses that one connection for all of its requests.

Some observations that got us here:

  • For the failed requests, @jgallagher saw some sequences where:
    • Dropshot logged that the client disconnected, and this log entry includes a request processing latency, and it was in the tens of seconds.
    • Immediately after this, we saw log entries showing that the request handling process was starting (e.g., authz going to load roles from the database).
  • @jgallagher added more verbose logging to Dropshot and found that these long time gaps covered the time between when Dropshot started to read the POST body and when it finished reading it.
  • @leftwo used snoop to capture packets from the Nexus side while reproducing the problem. This packet capture is at /staff/core/omicron-6771/john-snoop2.log in the engineering VPN. The packet capture shows:
    • a fair bit of ambient retransmits basically the whole time
    • for the request that ultimately failed, we saw gaps of about 3s, then 10s, then things seemed to pick up again and the upload was proceeding before the client suddenly sent a FIN and the server reported a 400 with a failure to read the body.

We went looking for any kstats that would point to large drop counts:

EVT22200005 # kstat -p | awk '/drop/ && $2 != 0'
ip:0:ipdrop:class       net
ip:0:ipdrop:crtime      64.389271751
ip:0:ipdrop:snaptime    13640.168871095
ipf:0:inbound:dropped:pps ceiling       0
ipf:0:outbound:dropped:pps ceiling      0
xde:0:opte0:in_drop     16
xde:0:opte0:in_drop_layer       16
xde:0:opte0:out_drop    1083
xde:0:opte0:out_drop_handle_pkt 51
xde:0:opte0:out_drop_layer      1032
xde:0:opte1:out_drop    49
xde:0:opte1:out_drop_handle_pkt 49
xde:0:opte2:out_drop    49
xde:0:opte2:out_drop_handle_pkt 49
xde:0:opte3:out_drop    49
xde:0:opte3:out_drop_handle_pkt 49
xde:0:opte4:in_drop     10
xde:0:opte4:in_drop_layer       10
xde:0:opte4:out_drop    49
xde:0:opte4:out_drop_handle_pkt 49
xde:0:opte5:out_drop    49
xde:0:opte5:out_drop_handle_pkt 49

Those don't seem large enough to explain the problem.

It was observed that we'd only seen this in CI and lab hardware, not on production hardware. We suspected the softnpu data path (not with great reason at this point) and started there to look for packet loss.

So we went looking for the softnpu process:

EVT22200005 # zlogin sidecar_softnpu
[Connected to zone 'sidecar_softnpu' pts/17]
Last login: Wed Oct 23 00:14:45 on pts/17
The illumos Project     helios-2.0.22957        October 2024
root@sidecar_softnpu:~# ptre
-bash: ptre: command not found
root@sidecar_softnpu:~# ptree
1838   /sbin/init
1843   /lib/svc/bin/svc.startd
  2348   /usr/lib/saf/ttymon -g -d /dev/console -l console -m ldterm,ttcompat -h -p side
1845   /lib/svc/bin/svc.configd
1887   /lib/inet/ipmgmtd
1893   /lib/inet/netcfgd
2079   /usr/lib/pfexecd
2183   /usr/lib/fm/fmd/fmd
2219   /softnpu/softnpu --uds-path /softnpu /softnpu/softnpu.toml
2295   /usr/sbin/nscd
2346   /usr/lib/utmpd
2371   /usr/sbin/cron
2390   /usr/sbin/syslogd
2430   /usr/lib/inet/in.ndpd
2463   /usr/sbin/sshd
19104  /usr/bin/login -z global -f root
  19105  -bash
    19154  ptree

and we looked for its DTrace probes:

EVT22200005 # dtrace -l | grep -i 2219
...
13935 p4rs_provider2219   asic_program.so _ZN4p4rs5table11key_matches17hb3699df0fa3194ddE match_miss
13936 softnpu_provider2219   asic_program.so _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h21870ee349223480E action
13937 softnpu_provider2219   asic_program.so _ZN12sidecar_lite12router_apply17h146f583e567a01f8E control_apply
13938 softnpu_provider2219   asic_program.so _ZN12sidecar_lite17nat_ingress_apply17hb4e85e891acb5747E control_table_hit
13939 softnpu_provider2219   asic_program.so _ZN12sidecar_lite17nat_ingress_apply17hb4e85e891acb5747E control_table_miss
13940 softnpu_provider2219   asic_program.so _ZN62_$LT$sidecar_lite..main_pipeline$u20$as$u20$p4rs..Pipeline$GT$14process_packet17h106f02a631ac5425E ingress_accepted
13941 softnpu_provider2219   asic_program.so _ZN62_$LT$sidecar_lite..main_pipeline$u20$as$u20$p4rs..Pipeline$GT$14process_packet17h106f02a631ac5425E ingress_dropped
13942 softnpu_provider2219   asic_program.so _ZN62_$LT$sidecar_lite..main_pipeline$u20$as$u20$p4rs..Pipeline$GT$14process_packet17h106f02a631ac5425E parser_accepted
13943 softnpu_provider2219   asic_program.so _ZN62_$LT$sidecar_lite..main_pipeline$u20$as$u20$p4rs..Pipeline$GT$14process_packet17h106f02a631ac5425E parser_dropped
13944 softnpu_provider2219   asic_program.so _ZN12sidecar_lite9parse_udp17h0d58607a2426c304E parser_transition
...

We enabled those probes (coarsely) to look for drops:

EVT22200005 # dtrace -n 'p4rs_provider2219:::,softnpu_provider2219:::{ @[probeprov, probename] = count(); }' -n 'tick-10s{ printa(@); }'
dtrace: description 'p4rs_provider2219:::,softnpu_provider2219:::' matched 10 probes
dtrace: description 'tick-10s' matched 1 probe
...
82  76274                        :tick-10s 
  softnpu_provider2219                                ingress_accepted                                              22291
  softnpu_provider2219                                parser_accepted                                               22291
  softnpu_provider2219                                control_table_miss                                            22561
  softnpu_provider2219                                parser_transition                                             88089
  softnpu_provider2219                                control_table_hit                                             88504
  softnpu_provider2219                                action                                                        99805
  softnpu_provider2219                                control_apply                                                121919
  p4rs_provider2219                                   match_miss                                                   168216

It doesn't look softnpu itself is dropping these packets (at least, it's "drop" probes aren't firing).

At this point, @rmustacc went looking through the kernel STREAMS data path, reverse engineering and exploring using cscope, mdb, and DTrace. We found we were going through dld_str_rx_raw and canputnext was returning 0, which through code inspection would result in dropping a packet. Here's the real smoking gun: seeing canputnext return 0 in dld_str_rx_raw, all on vnic1014:

EVT22200005 # dtrace -n 'fbt::dld_str_rx_raw:entry{ self->t = (dld_str_t *)arg0; }' -n 'fbt::dld_str_rx_raw:return{ self->t = NULL; }' -n 'fbt::canputnext:return/arg1 == 0 && self->t != NULL/{ @[stringof(self->t->ds_dlp->dl_name)] = count(); }' -n 'tick-5s{ printa(@); }'
dtrace: description 'fbt::dld_str_rx_raw:entry' matched 1 probe
dtrace: description 'fbt::dld_str_rx_raw:return' matched 1 probe
dtrace: description 'fbt::canputnext:return' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
 94  76275                         :tick-5s 

 94  76275                         :tick-5s 

 94  76275                         :tick-5s 
  vnic1014                                                        663

 94  76275                         :tick-5s 
  vnic1014                                                       1669

 94  76275                         :tick-5s 
  vnic1014                                                       2739

 94  76275                         :tick-5s 
  vnic1014                                                       3700

 94  76275                         :tick-5s 
  vnic1014                                                       4566

 94  76275                         :tick-5s 
  vnic1014                                                       5309

 94  76275                         :tick-5s 
  vnic1014                                                       6298

 94  76275                         :tick-5s 
  vnic1014                                                       7127

 94  76275                         :tick-5s 
  vnic1014                                                       8000

 94  76275                         :tick-5s 
  vnic1014                                                       8965

 94  76275                         :tick-5s 
  vnic1014                                                      10000

 94  76275                         :tick-5s 
  vnic1014                                                      10901

 94  76275                         :tick-5s 
  vnic1014                                                      11861

 94  76275                         :tick-5s 
  vnic1014                                                      12642

 94  76275                         :tick-5s 
  vnic1014                                                      13673

 94  76275                         :tick-5s 
  vnic1014                                                      14449

 94  76275                         :tick-5s 
  vnic1014                                                      15343

 94  76275                         :tick-5s 
  vnic1014                                                      16275

That's a lot of drops! What's vnic1014? We found two ways to map it to the vnic names we normally use: first, 1014 is the minor number of the device node:

EVT22200005 # ls -l /dev/net/ | grep -i 1014
crw-rw-rw-   1 root     sys      121, 1014 Oct 23 00:50 sc1_0
EVT22200005 # 

and:

> ::walk vnic_cache | ::print vnic_t vn_mh | ::printf "%s->%s\n" mac_impl_t mi_name mi_clients_list[0].mci_name ! sort
...
vnic1014->sc0_1
...

That's one of the softnpu devices.

We did an experiment to increase the max buffer size. First we found the address of the hiwat field:

EVT22200005 # dtrace -n 'fbt::dld_str_rx_raw:entry{ self->t = (dld_str_t *)arg0; }' -n 'fbt::dld_str_rx_raw:return{ self->t = NULL; }' -n 'fbt::canputnext:return/arg1 == 0 && self->t != NULL/{ @[stringof(self->t->ds_dlp->dl_name)] = quantize(self->t->ds_rq->q_next->q_nfsrv->q_count); printf("0x%p\n", self->t); exit(0); }' -n 'tick-5s{ printa(@); }'
dtrace: description 'fbt::dld_str_rx_raw:entry' matched 1 probe
dtrace: description 'fbt::dld_str_rx_raw:return' matched 1 probe
dtrace: description 'fbt::canputnext:return' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  76275                         :tick-5s 

112  35518                canputnext:return 0xfffffdc61db21398


  vnic1014                                          
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
            8192 |                                         0        

...
EVT22200005 # mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic apix cpc crypto mm random smbios zfs sata ip hook neti sockfs lofs vmm scsi_vhci arp ufs ipcc logindmux nsmb ptm ]
...
> 0xfffffdc61db21398::print -at dld_str_t ds_rq->q_next->q_nfsrv[]
fffffdc6113132a8 struct queue ds_rq->q_next->q_nfsrv = {
...
    fffffdc6113132f8 size_t ds_rq->q_next->q_nfsrv->q_hiwat = 0x1400
...

We decided to set the cap to 16 MiB, figuring if we've got 8 threads trying to send at most 1 MiB each, this should be plenty. Let's try it:

> $W
> fffffdc6113132f8/Z 0x1000000       
0xfffffdc6113132f8:             0x1400                  =       0x1000000

At this point, we tried unsuccessfully several times to reproduce the problem and saw no drops:

 # dtrace -n 'fbt::dld_str_rx_raw:entry{ self->t = (dld_str_t *)arg0; }' -n 'fbt::dld_str_rx_raw:return{ self->t = NULL; }' -n 'fbt::canputnext:return/arg1 == 0 && self->t != NULL/{ @[stringof(self->t->ds_dlp->dl_name)] = count(); }' -n 'tick-5s{ printa(@); }'
dtrace: description 'fbt::dld_str_rx_raw:entry' matched 1 probe
dtrace: description 'fbt::dld_str_rx_raw:return' matched 1 probe
dtrace: description 'fbt::canputnext:return' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
 34  76275                         :tick-5s 

 34  76275                         :tick-5s 

 34  76275                         :tick-5s 

 34  76275                         :tick-5s 

 34  76275                         :tick-5s 

We also looked for how big the buffer got during this experiment:

EVT22200005 # dtrace -n 'fbt::dld_str_rx_raw:entry{ self->t = (dld_str_t *)arg0; }' -n 'fbt::dld_str_rx_raw:return{ self->t = NULL; }' -n 'fbt::canputnext:return/self->t != NULL/{ @[stringof(self->t->ds_dlp->dl_name)] = quantize(self->t->ds_rq->q_next->q_nfsrv->q_count); }' -n 'tick-5s{ printa(@); }'
dtrace: description 'fbt::dld_str_rx_raw:entry' matched 1 probe
dtrace: description 'fbt::dld_str_rx_raw:return' matched 1 probe
dtrace: description 'fbt::canputnext:return' matched 1 probe
dtrace: description 'tick-5s' matched 1 probe
CPU     ID                    FUNCTION:NAME
 69  76275                         :tick-5s 
...
  simnet1018                                        
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
               1 |                                         0        

  simnet1017                                        
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
               1 |                                         0        

  simnet1016                                        
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 398      
               1 |                                         0        

  xde1105                                           
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 408631   
               1 |                                         0        

  simnet1011                                        
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1086441  
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         95       
             128 |@                                        17303    
             256 |                                         2398     
             512 |                                         430      
            1024 |                                         3        
            2048 |                                         0        

  vnic1014                                          
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         80       
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         3        
             128 |                                         4        
             256 |                                         3        
             512 |                                         0        
            1024 |                                         57       
            2048 |                                         42       
            4096 |                                         125      
            8192 |                                         208      
           16384 |                                         664      
           32768 |                                         1034     
           65536 |                                         7019     
          131072 |                                         13093    
          262144 |@                                        53451    
          524288 |@@@@@@                                   335216   
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             1511664  
         2097152 |@@@@                                     227709   
         4194304 |@                                        41265    
         8388608 |                                         0      

We think this pretty well explains all the behavior. Assuming that's true, the proposed fix is to have softnpu use the STREAMS equivalent of a sockopt to increase this buffer size.

I vaguely suspect that the recent change to the socket buffer size (#6690) made this problem more likely for two reasons:

  • with a larger socket buffer, I think we'd expect more data could be in flight at once, leading to larger bursts of packets sent, leading to more of them running into these drops
  • with a larger socket buffer, if the TCP flow control window gets comparably larger, it's less likely that all of the in-flight packets in the window won't hit the drop, meaning that at any given time it's more likely that we'll get stuck in a retransmit backoff. I can't say I feel like I understand this well -- it's more of an intuition than a rigorous explanation.

@leftwo
Copy link
Contributor

leftwo commented Oct 23, 2024

For reproduction and analysis steps (this was in the middle of the debugging session @davepacheco mentions above)

On the bench gimlet sn05 talking from the global zone to Nexus (through the softnpu zone/layer), I would upload an image that was local to the global zone like this:

/alan/bin/oxide disk import --project alan --description "ja" --path /alan/cli/jammy-server-cloudimg-amd64.raw --disk test6

The upload would fail, usually within a minute, but sometimes as long as 3 minutes.

Some initial debugging (recorded in /staff/core/omicron-6771/nexus-while-uploading.log ) indicated that the connection between the cli and the nexus dropshot was happening even before Nexus attempted to contact the pantry, or even do any local database lookups.

@jgallagher added additional logging to the oxide CLI and I added some logging to nexus to determine where the time was spent that resulted in the dropshot timeout occourning. Those logs are in:

/staff/core/omicron-6771/john-nexus-with-dropshot-2.log
/staff/core/omicron-6771/john-cli-with-nexus-2.log

From those logs we were able to determine (some of this was in John's analysis above).

  1. One thread of the 8 that the CLI starts was hitting the timeout, and this resulted in the cli aborting the upload.
  2. The cli abort is passed to nexus, which then closes the disk to future imports (and deletes the partially filled disk).
  3. The remaining cli threads can either see a timeout of their own, or find that the disk is no longer accepting bulk writes and those threads will fail.

For example, the CLI log has:

t=103.205170886s upload thread 6 write complete at offset 401604608 with len 524288
t=103.205242933s upload thread 6 starting bulk write import at offset 405798912 with len 524288

Then, 15 seconds later (with other thread activity during this time):

t=118.216682525s upload thread 6 write FAILED at offset 405798912 with len 524288: Communication Error: error sending request for url (http://192.168.1.22/v1/disks/test3/bulk-write?project=alan)
t=118.216980448s upload driver: dropping channels to upload threads
t=118.21702355s upload driver: returning read_result error sending chunk to thread 6 failed: channel closed
unwind_disk_bulk_write_stop: stopping bulk write import

To further determine where the 15 seconds was going, we gatherd traffic from the nexus zone using snoop and recorded it in the following files (in the same directory).

john-snoop2.log
john-nexus-with-dropshot-snoop2.log
john-cli-with-nexus-snoop2.log

Inspection of the snoop output showed us a large number of retries/retransmissions and what appeared to be dropped packets. This lead us to look deeper into the network stack to determine what could be the cause of this issue.

@iximeow
Copy link
Member

iximeow commented Oct 29, 2024

i ran into this again and after reading the very helpful analysis above i'm trying to figure out what to do with

Assuming that's true, the proposed fix is to have softnpu use the STREAMS equivalent of a sockopt to increase this buffer size.

i spent some time fishing through illumos' usr/src/uts/ to see if i could quickly figure out the STREAMS equivalent of a sockopt and i've landed at the intersection of "strrput looks relevant", "maybe there's an ioctl?" and "i'm not sure exactly what in softnpu would have a queue which might warrant further configuring." which is to say, i'm not sure how to make that suggestion real, i might be able to figure it out with some more digging and asking, but maybe someone here has a more direct pointer?

@davepacheco
Copy link
Collaborator

I think @rmustacc and @rcgoodfellow discussed an approach here but I'm not sure the status of that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

9 participants