Flaky test - The node backend is unreachable at the moment. #2320

KtorZ · 2020-11-16T08:07:43Z

Context

Bug report: https://jira.iohk.io/browse/ADP-647

Saw only once, could just be us overloading the node due to parallel tests, especially in this region

Test Case

STAKE_POOLS_QUIT_02 from src/Test/Integration/Scenario/API/Shelley/StakePools.hs:433
STAKE_POOLS_QUIT_01x

Failure / Counter-example

  1) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_QUIT_02 - Passphrase must be correct to quit
       uncaught exception: RequestException
       DecodeFailure "{\"code\":\"network_unreachable\",\"message\":\"The node backend is unreachable at the moment. Trying again in a bit might work.\"}"

  src/Test/Integration/Scenario/API/Shelley/StakePools.hs:850:9: 
  2) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_QUIT_01x - Fee boundary values, STAKE_POOLS_QUIT_01xx - I can quit if I have enough to cover fee
       uncaught exception: RequestException
       DecodeFailure "{\"code\":\"network_unreachable\",\"message\":\"The node backend is unreachable at the moment. Trying again in a bit might work.\"}"

It seems that most recent failures are always preceded by:

  src/Test/Integration/Framework/DSL.hs:1471:16: 
  1) API Specifications, SHELLEY_WALLETS, WALLETS_LIST_01 - Wallets are listed from oldest to newest
       uncaught exception: ErrorCall
       getFromResponse failed to get item
       CallStack (from HasCallStack):
         error, called at src/Test/Integration/Framework/DSL.hs:1471:16 in cardano-wallet-core-integration-2020.12.21-AAYZFneZURcKNEp0gMVmAX:Test.Integration.Framework.DSL

  To rerun use: --match "/API Specifications/SHELLEY_WALLETS/WALLETS_LIST_01 - Wallets are listed from oldest to newest/"

So, it might be the one we ought to investigate.

Resolution

See https://jira.iohk.io/browse/ADP-647

The text was updated successfully, but these errors were encountered:

2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking # Issue Number None. Addressing CI failures. # Overview - [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set. - [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky. - [x] Add manual test calling for running flaky tests # Comments - Should lower the failure rate by 21% of runs, from 59% to 38%. - Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea. - Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally. Recent bors failures: ``` succeded: 19, failed: 37 (66%), total: 56 excluding #expected failures Broken down by tags/issues: 10 times #2292 Flaky test - various DB properties causing timeout | #2292 7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 6 times 3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311 3 times #2230 Flaky STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230 2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224 1 times #another-integration-timeout | 1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337 1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320 1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207 1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118 ```   Co-authored-by: Johannes Lund <[email protected]>

2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=jonathanknowles a=Anviking # Issue Number None. Addressing CI failures. # Overview - [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set. - [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky. - [x] Add manual test calling for running flaky tests # Comments - Should lower the failure rate by 21% of runs, from 59% to 38%. - Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea. - Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally. Recent bors failures: ``` succeded: 19, failed: 37 (66%), total: 56 excluding #expected failures Broken down by tags/issues: 10 times #2292 Flaky test - various DB properties causing timeout | #2292 7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 6 times 3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311 3 times #2230 Flaky STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230 2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224 1 times #another-integration-timeout | 1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337 1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320 1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207 1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118 ```   Co-authored-by: Johannes Lund <[email protected]>

2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking # Issue Number None. Addressing CI failures. # Overview - [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set. - [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky. - [x] Also mark STAKE_POOLS_SMASH_01 pending - [x] Add manual test calling for running flaky tests # Comments - Should lower the failure rate by 21% of runs, from 59% to 38%. - Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea. - Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally. Recent bors failures: ``` succeded: 19, failed: 37 (66%), total: 56 excluding #expected failures Broken down by tags/issues: 10 times #2292 Flaky test - various DB properties causing timeout | #2292 7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 6 times 3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311 3 times #2230 Flaky STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230 2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224 1 times #another-integration-timeout | 1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337 1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320 1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207 1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118 ```   Co-authored-by: Johannes Lund <[email protected]>

2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking # Issue Number None. Addressing CI failures. # Overview - [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set. - [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky. - [x] Also mark STAKE_POOLS_SMASH_01 pending - [x] Add manual test calling for running flaky tests # Comments - Should lower the failure rate by 21% of runs, from 59% to 38%. - Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea. - Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally. Recent bors failures: ``` succeded: 19, failed: 37 (66%), total: 56 excluding #expected failures Broken down by tags/issues: 10 times #2292 Flaky test - various DB properties causing timeout | #2292 7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 6 times 3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311 3 times #2230 Flaky STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230 2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224 1 times #another-integration-timeout | 1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337 1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320 1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207 1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118 ```   2346: get rid of 'OnDanglingChange' option for fee balancing r=KtorZ a=KtorZ # Issue Number  ADP-568 # Overview  - [ ] I have removed 'OnDanglingChange' option for fee balancing. # Comments  This option allowed choosing between two modes: SaveMoney and PayAndBalance. The former was used with cardano-node, and the latter used with jormungandr. The reason for having a difference is was because of discrepency in the minimum transaction expected by both ledger. On jormungandr, transactions have to be _exactly_ balanced and leave exactly the expected fees required by the network. On the counterpart, the fee calculation was only a function of the number of inputs and outputs... therefore much easier to satisfy than on cardano-node. Now that we've removed jormungandr, this extra indirection / complexity is just harmful. Since the 'PayAndBalance' mode is never used, I've removed the option entirely and made code assume 'SaveMoney' everywhere it used to choose between both alternatives. This also seemingly remove the 'allowUnbalancedTx' field from the transaction layer which was directly related to this option.  Co-authored-by: Johannes Lund <[email protected]> Co-authored-by: KtorZ <[email protected]>

2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking # Issue Number None. Addressing CI failures. # Overview - [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set. - [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky. - [x] Also mark STAKE_POOLS_SMASH_01 pending - [x] Add manual test calling for running flaky tests # Comments - Should lower the failure rate by 21% of runs, from 59% to 38%. - Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea. - Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally. Recent bors failures: ``` succeded: 19, failed: 37 (66%), total: 56 excluding #expected failures Broken down by tags/issues: 10 times #2292 Flaky test - various DB properties causing timeout | #2292 7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 6 times 3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311 3 times #2230 Flaky STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230 2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224 1 times #another-integration-timeout | 1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337 1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320 1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295 1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207 1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118 ```   Co-authored-by: Johannes Lund <[email protected]>

rvl · 2021-01-07T12:29:25Z

@KtorZ That error message is a big red herring.
There is no evidence that the node backend is unreachable.
All it means is that some LSQ returned some kind of failure.
We should review and merge #2419 as a matter of urgency, so that we stop seeing this error message, and instead have enough information logged so that we can debug the failure.

KtorZ · 2021-01-07T15:55:56Z

@rvl I agree. The error message isn't really helpful as it is but I am pretty convinced that it would suffice to retry the stakeDistribution query after re-acquiring a more recent point.

rvl · 2021-01-12T06:30:47Z

I opened an issue - ADP-647.

Anviking · 2021-01-12T09:39:47Z

Seems odd that this failure only appeared recently though

Anviking · 2021-01-20T12:27:32Z

Let's keep this open still.

rvl · 2021-01-21T06:02:04Z

This should be closed because the tests fail due to an actual bug not a flaky test.
Also the error message has been improved, so the title of this issue is no longer relevant.

Anviking · 2021-01-21T08:21:34Z

I think it's useful that new failures end up in the same group as the previous 16 failures, such that people see that this is a common problem, and don't create new tickets for it.

It's still a test failure, even if it is also a bug.

piotr-iohk · 2021-01-21T08:28:32Z

Would this be a relevant bug report? -> https://jira.iohk.io/browse/ADP-647

Anviking · 2021-01-21T08:30:02Z

Yes

rvl · 2021-01-21T09:21:39Z

OK

2449: Re-write LocalStateQuery client logic to eliminate acquire failures r=Anviking a=Anviking # Issue Number ADP-647, #2320  # Overview - [x] Allow `send`-ing a composition of queries against a single acquired point, not just one. - [x] Makes acquire failures practically impossible - [x] Re-add multi-era support with reduced boilerplate - [x] Some polish still needed - [x] Re-add tracing of query times (less granular than before, but done) # Comments Pretty sure this _is_ - eliminating acquire failures - the right direction But also - Might introduce a new set of problems  ## Failures I have run tests locally a lot on this branch. One new failure I _occasionally_ see is ``` src/Test/Integration/Framework/DSL.hs:1797:7: 1) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_JOIN_01rewards - Can join a pool, earn rewards and collect them expected a successful response but got an error: DecodeFailure "{\"code\":\"created_invalid_transaction\",\"message\":\"That's embarrassing. It looks like I've created an invalid transaction that could not be parsed by the node. Here's an error message that may help with debugging: HardForkApplyTxErrFromEra S (S (S (Z (WrapApplyTxErr {unwrapApplyTxErr = ApplyTxError [LedgerFailure (DelegsFailure (WithdrawalsNotInRewardsDELEGS (fromList [(RewardAcnt {getRwdNetwork = Mainnet, getRwdCred = KeyHashObj (KeyHash \\\"9c0ff007dd21bbf24960bd12ae4009efb8cad076228ef1a54c7b5dbe\\\")},Coin 7010064794)])))]}))))\"}" While verifying (Status {statusCode = 500, statusMessage = "Internal Server Error"},Left (DecodeFailure "{\"code\":\"created_invalid_transaction\",\"message\":\"That's embarrassing. It looks like I've created an invalid transaction that could not be parsed by the node. Here's an error message that may help with debugging: HardForkApplyTxErrFromEra S (S (S (Z (WrapApplyTxErr {unwrapApplyTxErr = ApplyTxError [LedgerFailure (DelegsFailure (WithdrawalsNotInRewardsDELEGS (fromList [(RewardAcnt {getRwdNetwork = Mainnet, getRwdCred = KeyHashObj (KeyHash \\\"9c0ff007dd21bbf24960bd12ae4009efb8cad076228ef1a54c7b5dbe\\\")},Coin 7010064794)])))]}))))\"}")) To rerun use: --match "/API Specifications/SHELLEY_STAKE_POOLS/STAKE_POOLS_JOIN_01rewards - Can join a pool, earn rewards and collect them/" ``` Important part is `WithdrawalsNotInRewardsDELEGS`. So seems we are not aware of the rewards having already been spent. Maybe this PR makes the rewards slower to update, somehow, not sure. Edit: digging through my notes, I _have_ seen this failure on another branch — once. I think this PR makes it more likely to occur, but not be inherently related, then. Co-authored-by: Johannes Lund <[email protected]> Co-authored-by: Rodney Lorrimar <[email protected]>

Anviking · 2021-01-25T11:34:22Z

Should be fixed now.

KtorZ added Bug Test failure labels Nov 16, 2020

KtorZ mentioned this issue Nov 16, 2020

Additional tweaks with regards to static address construction via the API #2318

Merged

Anviking removed the Bug label Nov 19, 2020

KtorZ mentioned this issue Dec 31, 2020

Additional checks for input existence in transactions #2405

Merged

KtorZ changed the title ~~Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02~~ Flaky test - The node backend is unreachable at the moment. Jan 7, 2021

rvl closed this as completed Jan 12, 2021

This was referenced Jan 12, 2021

Make wallet run on Mary #2400

Merged

Fix flaky stake pools integration test #2419

Merged

Tweak bors-stats.sh to support jira links #2442

Merged

Index the wallet's UTxO set by asset type #2431

Merged

Anviking mentioned this issue Jan 20, 2021

Fix some integration tests #2452

Merged

piotr-iohk mentioned this issue Jan 20, 2021

[Duplicate] STAKE_POOLS_JOIN_04 fails sporadically #2458

Closed

Anviking reopened this Jan 20, 2021

Anviking mentioned this issue Jan 20, 2021

Multi-Asset Coin Selection #2450

Merged

Anviking self-assigned this Jan 20, 2021

This was referenced Jan 20, 2021

Re-write LocalStateQuery client logic to eliminate acquire failures #2449

Merged

Use a single-striped connection pool for each database layer instead of a single shared connection #2416

Merged

rvl mentioned this issue Jan 22, 2021

Update cardano-node and libraries to 1.25.0 #2459

Merged

rvl mentioned this issue Jan 25, 2021

Multi-asset API extensions #2447

Merged

7 tasks

Anviking closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test - The node backend is unreachable at the moment. #2320

Flaky test - The node backend is unreachable at the moment. #2320

KtorZ commented Nov 16, 2020 •

edited by piotr-iohk

Loading

rvl commented Jan 7, 2021

KtorZ commented Jan 7, 2021

rvl commented Jan 12, 2021

Anviking commented Jan 12, 2021

Anviking commented Jan 20, 2021

rvl commented Jan 21, 2021

Anviking commented Jan 21, 2021

piotr-iohk commented Jan 21, 2021

Anviking commented Jan 21, 2021

rvl commented Jan 21, 2021

Anviking commented Jan 25, 2021

Flaky test - The node backend is unreachable at the moment. #2320

Flaky test - The node backend is unreachable at the moment. #2320

Comments

KtorZ commented Nov 16, 2020 • edited by piotr-iohk Loading

Context

Test Case

Failure / Counter-example

Resolution

rvl commented Jan 7, 2021

KtorZ commented Jan 7, 2021

rvl commented Jan 12, 2021

Anviking commented Jan 12, 2021

Anviking commented Jan 20, 2021

rvl commented Jan 21, 2021

Anviking commented Jan 21, 2021

piotr-iohk commented Jan 21, 2021

Anviking commented Jan 21, 2021

rvl commented Jan 21, 2021

Anviking commented Jan 25, 2021

KtorZ commented Nov 16, 2020 •

edited by piotr-iohk

Loading