Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

networking background tasks emit lots of warnings in simulated environments (including test suite) #6076

Open
davepacheco opened this issue Jul 12, 2024 · 0 comments

Comments

@davepacheco
Copy link
Collaborator

It's easiest to see this in omicron-dev run-all:

$ cargo run --bin=omicron-dev -- run-all
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.50s
     Running `target/debug/omicron-dev run-all`
omicron-dev: setting up all services ... 
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.9781.0.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.9781.0.log"
DB URL: postgresql://root@[::1]:43799/omicron?sslmode=disable
DB address: [::1]:43799
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.9781.2.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.9781.2.log"
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.9781.3.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.9781.3.log"
omicron-dev: services are running.
omicron-dev: nexus external API:    127.0.0.1:12220
omicron-dev: nexus internal API:    [::1]:12221
omicron-dev: cockroachdb pid:       9862
omicron-dev: cockroachdb URL:       postgresql://root@[::1]:43799/omicron?sslmode=disable
omicron-dev: cockroachdb directory: /dangerzone/omicron_tmp/.tmphbrhfC
omicron-dev: internal DNS HTTP:     http://[::1]:38018
omicron-dev: internal DNS:          [::1]:36531
omicron-dev: external DNS name:     oxide-dev.test
omicron-dev: external DNS HTTP:     http://[::1]:60088
omicron-dev: external DNS:          [::1]:50282
omicron-dev:   e.g. `dig @::1 -p 50282 test-suite-silo.sys.oxide-dev.test`
omicron-dev: management gateway:    http://[::1]:57986 (switch0)
omicron-dev: management gateway:    http://[::1]:58720 (switch1)
omicron-dev: silo name:             test-suite-silo
omicron-dev: privileged user name:  test-privileged

If you look at the log file, it's emitting lots of warnings. It's easiest to see them by filtering for warning-level messages:

23:06:56.891Z WARN e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds
    background_task = bfd_manager
    reason = Communication Error: error sending request for url (http://[::1]:12225/local/switch-id): error trying to connect: tcp connect error: Connection refused (os error 146)
    zone_address = ::1
23:06:58.007Z WARN e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds
    background_task = nat_v4_garbage_collector
    reason = Communication Error: error sending request for url (http://[::1]:12225/local/switch-id): error trying to connect: tcp connect error: Connection refused (os error 146)
    zone_address = ::1
23:06:59.285Z WARN e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c (ServerContext): failed to identify switch slot for dendrite, will retry in 2 seconds
    background_task = switch_port_config_manager
    rack_id = c19a698f-c6f9-4a17-ae30-20d711b8f7dc
    reason = Communication Error: error sending request for url (http://[::1]:12225/local/switch-id): error trying to connect: tcp connect error: Connection refused (os error 146)
    zone_address = ::1

This also happens if you run Nexus by hand and I expect it happens in the test suite, too.

These warnings are coming from at least three different background tasks.

This is coming from map_switch_zone_addrs():

omicron/nexus/src/app/mod.rs

Lines 1036 to 1041 in e4bcfee

warn!(
log,
"failed to identify switch slot for dendrite, will retry in 2 seconds";
"zone_address" => #?addr,
"reason" => #?e
);

I noticed that if you're running Nexus by hand, you run into the same warning and it blocks Nexus startup. The workaround seems to be to set mgd in the Nexus config file to point directly at the instances. That's what the test suite does:

self.config.pkg.mgd.insert(switch_location, config);

and it works because it sets up clients directly:

for (location, config) in &config.pkg.mgd {
let mg_client = mg_admin_client::Client::new(
&format!("http://{}", config.address),
log.clone(),
);
mg_clients.insert(*location, Arc::new(mg_client));
}

and bypasses the loop that emits this warning:

if config.pkg.mgd.is_empty() {

This might be a dup of #5201? I was confused that even after setting these values in the config, that fixed one part of Nexus (the startup path) but not the other (the background task). I guess maybe the difference is that the startup path was getting stuck on mgd, while the background tasks are getting stuck on dendrite, and I only overrode mgd in my config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant