Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in CI: test_omdb_success_cases #6505

Closed
rcgoodfellow opened this issue Sep 2, 2024 · 19 comments · Fixed by #6881
Closed

test failed in CI: test_omdb_success_cases #6505

rcgoodfellow opened this issue Sep 2, 2024 · 19 comments · Fixed by #6881
Assignees
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@rcgoodfellow
Copy link
Contributor

This test failed on a CI run on pull request 6475:

https://github.com/oxidecomputer/omicron/pull/6475/checks?check_run_id=29546110600

https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H#S5276

Excerpt from the log showing the failure:

        FAIL [  25.576s] omicron-omdb::test_all_output test_omdb_success_cases

--- STDOUT:              omicron-omdb::test_all_output test_omdb_success_cases ---

running 1 test
running commands with args: ["db", "disks", "list"]
running commands with args: ["db", "dns", "show"]
running commands with args: ["db", "dns", "diff", "external", "2"]
running commands with args: ["db", "dns", "names", "external", "2"]
running commands with args: ["db", "instances"]
running commands with args: ["db", "reconfigurator-save", "/var/tmp/omicron_tmp/.tmpVjFflB/reconfigurator-save.out"]
running commands with args: ["db", "sleds"]
running commands with args: ["db", "sleds", "-F", "discretionary"]
running commands with args: ["mgs", "inventory"]
running commands with args: ["nexus", "background-tasks", "doc"]
running commands with args: ["nexus", "background-tasks", "show"]
running commands with args: ["nexus", "background-tasks", "show", "saga_recovery"]
running commands with args: ["nexus", "background-tasks", "show", "blueprint_loader", "blueprint_executor"]
running commands with args: ["nexus", "background-tasks", "show", "dns_internal"]
running commands with args: ["nexus", "background-tasks", "show", "dns_external"]
running commands with args: ["nexus", "background-tasks", "show", "all"]
running commands with args: ["nexus", "sagas", "list"]
running commands with args: ["--destructive", "nexus", "sagas", "demo-create"]
running commands with args: ["nexus", "sagas", "list"]
running commands with args: ["--destructive", "nexus", "background-tasks", "activate", "inventory_collection"]
running commands with args: ["nexus", "blueprints", "list"]
running commands with args: ["nexus", "blueprints", "show", "5103da0a-8625-4be7-b03e-16ff5fde04a9"]
running commands with args: ["nexus", "blueprints", "show", "current-target"]
running commands with args: ["nexus", "blueprints", "diff", "5103da0a-8625-4be7-b03e-16ff5fde04a9", "current-target"]
@@ -55,10 +55,13 @@
 ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
 ---------------------------------------------
 stderr:
 note: using database URL postgresql://root@[::1]:REDACTED_PORT/omicron?sslmode=disable
 note: database schema version matches expected (<redacted database version>)
+thread 'tokio-runtime-worker' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-bb8-diesel-0.2.1/src/async_traits.rs:97:14:
+called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(36))
+note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 =============================================
 EXECUTING COMMAND: omdb ["db", "reconfigurator-save", "<TMP_PATH_REDACTED>"]
 termination: Exited(0)
 ---------------------------------------------
 stdout:

test test_omdb_success_cases ... FAILED

failures:

failures:
    test_omdb_success_cases

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 2 filtered out; finished in 25.36s


--- STDERR:              omicron-omdb::test_all_output test_omdb_success_cases ---
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.0.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.0.log"
DB URL: postgresql://root@[::1]:43788/omicron?sslmode=disable
DB address: [::1]:43788
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.2.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.2.log"
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.3.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.3.log"
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_case.19791.4.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_case.19791.4.log"
thread 'test_omdb_success_cases' panicked at dev-tools/omdb/tests/test_all_output.rs:242:5:
assertion failed: string doesn't match the contents of file: "tests/successes.out" see diffset above
                set EXPECTORATE=overwrite if these changes are intentional
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: assert_contents<&str>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/expectorate-1.1.0/src/lib.rs:64:9
   3: {async_fn#0}
             at ./tests/test_all_output.rs:242:5
   4: {async_block#0}
             at ./tests/test_all_output.rs:116:1
   5: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   8: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
   9: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  12: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  17: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  18: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  21: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  26: test_omdb_success_cases
             at ./tests/test_all_output.rs:116:1
  27: test_all_output::test_omdb_success_cases::{{closure}}
             at ./tests/test_all_output.rs:117:70
  28: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmpDdfI0d"
If you would like to access the database for debugging, run the following:

# Run the database
cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmpDdfI0d/data"
# Access the database. Note the port may change if you run multiple databases.
cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failed to clean up ClickHouse data dir:
- /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.1-clickhouse-Tjceqh: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpgUhVaY
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpU3R6qP
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpoJXOMm
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpoo1igu
@davepacheco
Copy link
Collaborator

Bummer -- and thanks for filing this.

From the output, it looks to me like the test ran the command omdb db instances and that panicked with:

thread 'tokio-runtime-worker' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-bb8-diesel-0.2.1/src/async_traits.rs:97:14:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(36))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

That's not very much to go on. We don't have more because this was a subprocess -- the test ultimately failed only because the output didn't match what it expected. I haven't totally given up yet but I've put up #6516 so that if we hit this again we'll get more information about a panic from the subprocess.

@davepacheco
Copy link
Collaborator

The panic message is coming from here:
https://github.com/oxidecomputer/async-bb8-diesel/blob/1850c9d9a9311ff6a60cadee9023e7693eda3304/src/async_traits.rs#L97

But I think that's just propagating a panic that happened in the middle of just about anything that async-bb8-diesel was doing. There are a few unwraps in in the omdb db instances command itself:

/// Run `omdb db instances`: list data about customer VMs.
async fn cmd_db_instances(
opctx: &OpContext,
datastore: &DataStore,
fetch_opts: &DbFetchOptions,
running: bool,
) -> Result<(), anyhow::Error> {
use db::schema::instance::dsl;
use db::schema::vmm::dsl as vmm_dsl;
let limit = fetch_opts.fetch_limit;
let mut query = dsl::instance.into_boxed();
if !fetch_opts.include_deleted {
query = query.filter(dsl::time_deleted.is_null());
}
let instances: Vec<InstanceAndActiveVmm> = query
.left_join(
vmm_dsl::vmm.on(vmm_dsl::id
.nullable()
.eq(dsl::active_propolis_id)
.and(vmm_dsl::time_deleted.is_null())),
)
.limit(i64::from(u32::from(limit)))
.select((Instance::as_select(), Option::<Vmm>::as_select()))
.load_async(&*datastore.pool_connection_for_tests().await?)
.await
.context("loading instances")?
.into_iter()
.map(|i: (Instance, Option<Vmm>)| i.into())
.collect();
let ctx = || "listing instances".to_string();
check_limit(&instances, limit, ctx);
let mut rows = Vec::new();
let mut h_to_s: HashMap<SledUuid, String> = HashMap::new();
for i in instances {
let host_serial = if i.vmm().is_some() {
if let std::collections::hash_map::Entry::Vacant(e) =
h_to_s.entry(i.sled_id().unwrap())
{
let (_, my_sled) = LookupPath::new(opctx, datastore)
.sled_id(i.sled_id().unwrap().into_untyped_uuid())
.fetch()
.await
.context("failed to look up sled")?;
let host_serial = my_sled.serial_number().to_string();
e.insert(host_serial.to_string());
host_serial.to_string()
} else {
h_to_s.get(&i.sled_id().unwrap()).unwrap().to_string()
}
} else {
"-".to_string()
};
if running && i.effective_state() != InstanceState::Running {
continue;
}
let cir = CustomerInstanceRow {
id: i.instance().id().to_string(),
name: i.instance().name().to_string(),
state: i.effective_state().to_string(),
propolis_id: (&i).into(),
sled_id: (&i).into(),
host_serial,
};
rows.push(cir);
}
let table = tabled::Table::new(rows)
.with(tabled::settings::Style::empty())
.with(tabled::settings::Padding::new(0, 1, 0, 0))
.to_string();
println!("{}", table);
Ok(())
}

But if we panicked in those, I don't think it would show up in async-bb8-diesel. I'm trying to figure out what would show up there. We're not using transaction_async in this code so I don't see how we could have entered async-bb8-diesel and then called back out to this code. An example might be if the synchronous load panicked, but that's not our code so that would be surprising.

I'm also going to file an async-bb8-diesel bug because it seems like it could propagate more about the panic error in this situation.

@davepacheco
Copy link
Collaborator

Actually, I'm not sure this is an async-bb8-diesel bug. Looking more closely at the JoinError, it's saying that the underlying task was cancelled, not that it panicked. How did that happen? Looking at the docs:

When you shut down the executor, it will wait indefinitely for all blocking operations to finish. You can use shutdown_timeout to stop waiting for them after a certain timeout. Be aware that this will still not cancel the tasks — they are simply allowed to keep running after the method returns. It is possible for a blocking task to be cancelled if it has not yet started running, but this is not guaranteed.

One way I could imagine this happening is if the program started an async database operation (like load_async) but then panicked before the corresponding tokio task was started. That might trigger teardown of the executor and we might see this second panic. But then shouldn't we see some information about that other panic?

@hawkw
Copy link
Member

hawkw commented Sep 24, 2024

I hit this one today on PR #6652: https://buildomat.eng.oxide.computer/wg/0/details/01J8JH1GHTE595MAF5YBTA8BS6/3ybl2B1ZPCoj5D0UfV5BgvmfQpYqlpYGkOVlmy3dpaUvVN5L/01J8JH2AVHD7EAR6AT7ZG8WS72#S5595. Wanted to comment here because it looks like this particular flake may result in very different panic messages depending on which OMDB command actually hit the issue, so folks hitting this flake might report new issues for it that are actually duplicates of this.

@gjcolombo
Copy link
Contributor

@sunshowers
Copy link
Contributor

Could this be a shutdown task ordering issue? At https://docs.rs/crate/async-bb8-diesel/0.2.1/source/src/async_traits.rs#95, if the spawned task is cancelled due to the runtime shutting down, then there's going to be a panic here.

@sunshowers
Copy link
Contributor

Yeah, looking at it I'm pretty sure that it is a shutdown ordering issue. The API is generic over arbitrary E so there's sadly no good place to put in "child task got cancelled, runtime shutting down" at the moment. So we'll probably have to make a breaking change to async-bb8-diesel.

@sunshowers
Copy link
Contributor

Hmm, but as Dave pointed out this should only happen if the underlying task was cancelled. And the spawn_blocking documentation promises that the task will never be cancelled. Wonder if there's a deeper tokio bug here.

@sunshowers
Copy link
Contributor

Ah, according to tokio-rs/tokio#3805 (comment) what's happening is that the runtime is shutting down before spawn_blocking is called. In that case, a handle is returned but it immediately fails with a cancelled error.

@sunshowers
Copy link
Contributor

sunshowers commented Oct 12, 2024

I've put up a tentative PR at oxidecomputer/async-bb8-diesel#77. I don't feel great about it, but I also don't see another way sadly.

This is going to end up infecting Omicron as well -- we'll no longer be dealing with Diesel errors, but instead with this new wrapper error type. (And here again we'd need to be careful to not panic on all errors -- instead, if the error is a shutdown error, silently ignoring it somehow.)

Ugh.

@smklein
Copy link
Collaborator

smklein commented Oct 14, 2024

I understand why we think async-bb8-diesel is causing this particular backtrace - the diagnosis of "We are trying to spawn new work while the tokio runtime is shutting down" seems accurate - but this seems like it might be a secondary failure, rather than the primary reason for the test failing.

Framed another way: why are we trying to spawn new work amid a runtime shutdown?

I think that propagating better error information from async-bb8-diesel would be worthwhile, just want to confirm my understanding here that "it's weird omdb db instances is sending new requests to the DB while also shutting down, right?"

@hawkw
Copy link
Member

hawkw commented Oct 14, 2024

I'm guessing the reason we are wondering about async-bb8-diesel is that, as I understand it, nothing else in the omdb db instances command has spawned any tasks in the background, so when the runtime shuts down because the #[tokio::main] function has exited, there isn't anything else left that might be trying to spawn new tasks?

@sunshowers
Copy link
Contributor

Is there some kind of background task that might be hitting the DB periodically?

edit: it's a bit hard to be completely sure, but the stack trace does seem to suggest this is happening within a task.

@sunshowers
Copy link
Contributor

sunshowers commented Oct 15, 2024

The error is coming from here:

conn.ping_async().await.map_err(|e| {
-- this looks like a validity check that qorb is doing, which makes sense. Sounds like we may want to change the qorb API as well.

@smklein
Copy link
Collaborator

smklein commented Oct 15, 2024

Yeah, this tracks with the timing when qorb was integrated into Omicron (in dd85331, which landed right before this bug was first reported). On the bright side, this doesn't seem like a bug that would impact prod, but a test shutdown ordering.

I'll look at how we're terminating the pool. If we can cleanly terminate the qorb pool when the test exits, that should also help resolve this issue.

@smklein smklein self-assigned this Oct 15, 2024
@davepacheco
Copy link
Collaborator

Sorry I'm a little confused about our hypothesized sequence of events leading to this. Is it something like: in the child process:

  • it sets up everything, runs the bulk of the command, then drops its qorb connection handle
  • qorb has previously started some other tokio task ...
  • the main task finishes and the executor gets dropped
  • that qorb task gets to running ping_async, which enters async-bb8-diesel, which does spawn_blocking, which returns a handle that's already cancelled, and async-bb8-diesel panics?

(I feel like that's not exactly right but I'm just trying to put together the pieces above)

@smklein
Copy link
Collaborator

smklein commented Oct 15, 2024

Yeah, it's worthwhile clarifying, there's a lot of moving pieces. This is my hypothesis:

  • This test is using nexus_test, so it spawns a qorb pool within Nexus. This spawns background tasks in the test process.
  • The test launches a bunch of processes, which each invoke omdb. I think these processes are actually unrelated to the failure we're seeing? Namely, the "omdb part" of this doesn't matter, it just matters that we're running a nexus_test.
  • Once the test ends, it drops the qorb pool, along with Nexus and the entire tokio runtime.
  • sometimes, one of the qorb tasks is still running while the tokio runtime is getting dropped. It calls ping_async, which makes an async_bb8_diesel call, and panics. But really, it could be doing any background work, like trying to make a new connection to the DB.

The qorb "termination" code is pretty half-baked right now -- it just calls abort on background tasks in drop, but this signals them to start cancellation, and doesn't necessarily guarantee they get cancelled.

My plan is the following:

  1. Add explicit termination code within qorb, rather than relying on drop. This should give us a way to ensure we've stopped all background tasks.
  2. Use that explicit termination in nexus_test. This should stop all background tasks while we're exiting the test, but before the tokio runtime actually starts shutting down.

@smklein
Copy link
Collaborator

smklein commented Oct 16, 2024

#6881 is my proposed fix, with an attempt to summarize "what I believe is going wrong" in the PR message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants