Fix test flakiness #388

spacebear21 · 2024-11-13T22:01:10Z

This change seems to reduce the frequency of the lock contention on TCP sockets in tests but doesn't fix it entirely and I'm not yet sure why. Leaving in Draft status while I debug further.

DanGould · 2024-11-14T01:29:33Z

Have you been able to reproduce the bug in a local environment or are you still relying on CI to produce it?

spacebear21 · 2024-11-14T02:26:27Z

I'm able to reproduce locally but not as frequently as it occurs in CI (possibly just because CI runs 3x as many jobs).

Initialize OHTTP relay and payjoin directory once and reuse those connections across all v2 tests.

spacebear21 · 2024-11-20T22:25:44Z

Suggestion from @nothingmuch : Make the ohttp relay and payjoin directory port parameters optional, and simply find a free port during initialization if one is not provided.

spacebear21 · 2024-11-21T18:58:31Z

Found an issue that explains why sharing infrastructure across tests isn't working: tokio-rs/tokio#2374. The approach of sharing a global connection pool seems generally discouraged by the tokio maintainers...

TL;DR:

[tokio::test] creates a new runtime for each test, so:

the async connection pool of the application is initialized when the first test accesses it

the async connection pool is then bound to the runtime of the first test

when the first test ends, its runtime is closed and consequently, the pool is also closed

then, all other tests fail because the pool is closed.

spacebear21 · 2024-11-27T01:09:14Z

Summarizing the problem:

Each v2 test initializes a OHTTP relay and payjoin directory. These processes run on a random "free" port selected by find_free_port. There is a race condition between finding a free port and actually initializing the relay/directory, resulting in the occasional "Address in use" error when the relay/directory process tries to bind to that port.

I tried the following approaches, none of which fix the issue:

Use a Mutex to keep track of which ports have already been reserved. This doesn't work, afaict because it doesn't account for ports used by other processes (e.g. the redis and http clients).
Spin up the test infrastructure (ohttp relay + directory) once and share the connections across tests. This doesn't work because tokio tests don't share a runtime (see Fix test flakiness #388 (comment))
Don't specify a port and let the directory/ohttp relay select a free port on initialization (as suggested by @nothingmuch ). The issue with this is that the tests need to know the relay & directory ports so that they know where to direct requests.

Other things I considered:

Use named unix sockets instead of tcp but this seems quite involved...
Use testcontainers to containerize the relay and directory processes, but this would necessitate making custom docker images and I'm not even sure it would work.

I've spent what feels like too many hours on this and am truly stumped... Happy to take another stab if anyone has new suggestions, but until then I'm putting this on the back burner.

DanGould · 2024-11-27T03:55:44Z

It can be painful to be stuck on something for a long time. Clearly what you've done so far is valuable and you've put lots of thought into the approach and potential remedies.

I was wondering most why @nothingmuch's comment didn't work and took a stab at it myself (to no avail)

Don't specify a port and let the directory/ohttp relay select a free port on initialization (#388 (comment) by @nothingmuch ). The issue with this is that the tests need to know the relay & directory ports so that they know where to direct requests.

And the problem with this is that it's easy to get a deadlock passing both a handle and a port up from payjoin-directory. I'm sure there's a way to make this work, but my screwing around with it did not readily figure it out. The GET request in the wait_for_service_ready loop seemed never to be able to send(). Easy to understand how you get into a black hole with this one.

In the meantime, might you share any scripts you used to test the tests? So we can pick this up at a time of rest?

@0xBEEFCAF3 just shared today how stoked he was on the clean cut-through interface. That took a long while to get right. This isn't the first time we've run into serious snags and I know we'll get through it.

spacebear21 · 2024-11-27T04:59:43Z

In the meantime, might you share any scripts you used to test the tests?

It pretty much amounts to this bad boy:

export untilfail ()
{
  count=0
  while "$@"; do
    (( count++ ))
    echo "###### RUN COUNT: $count ######"
  done && say "failed after $count runs"
}

I run it like this and wait until the error occurs:

untilfail cargo test --test integration --features=v2,danger-local-https -- --nocapture

Note: on my old machine this fails reliably within ~40 runs - on my new/much faster machine it ~never fails.

debug long running errors

974cdca

spacebear21 added 2 commits November 15, 2024 14:53

Share infrastructure between tests

efb80d8

Initialize OHTTP relay and payjoin directory once and reuse those connections across all v2 tests.

debugging

2a99b06

spacebear21 force-pushed the fix-test-flakiness branch from 3d1b41b to 2a99b06 Compare November 15, 2024 19:54

spacebear21 linked an issue Nov 20, 2024 that may be closed by this pull request

Fix flaky integration tests: "Ohttp relay is long running" #380

Open

spacebear21 mentioned this pull request Nov 27, 2024

Fix flaky integration tests: "Ohttp relay is long running" #380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test flakiness #388

Fix test flakiness #388

spacebear21 commented Nov 13, 2024

DanGould commented Nov 14, 2024

spacebear21 commented Nov 14, 2024

spacebear21 commented Nov 20, 2024

spacebear21 commented Nov 21, 2024 •

edited

Loading

spacebear21 commented Nov 27, 2024

DanGould commented Nov 27, 2024 •

edited

Loading

spacebear21 commented Nov 27, 2024

Fix test flakiness #388

Are you sure you want to change the base?

Fix test flakiness #388

Conversation

spacebear21 commented Nov 13, 2024

DanGould commented Nov 14, 2024

spacebear21 commented Nov 14, 2024

spacebear21 commented Nov 20, 2024

spacebear21 commented Nov 21, 2024 • edited Loading

spacebear21 commented Nov 27, 2024

DanGould commented Nov 27, 2024 • edited Loading

spacebear21 commented Nov 27, 2024

spacebear21 commented Nov 21, 2024 •

edited

Loading

DanGould commented Nov 27, 2024 •

edited

Loading