Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test - The node backend is unreachable at the moment. #2320

Closed
KtorZ opened this issue Nov 16, 2020 · 11 comments
Closed

Flaky test - The node backend is unreachable at the moment. #2320

KtorZ opened this issue Nov 16, 2020 · 11 comments
Assignees

Comments

@KtorZ
Copy link
Member

KtorZ commented Nov 16, 2020

Context

Bug report: https://jira.iohk.io/browse/ADP-647

Saw only once, could just be us overloading the node due to parallel tests, especially in this region

Test Case

  • STAKE_POOLS_QUIT_02 from src/Test/Integration/Scenario/API/Shelley/StakePools.hs:433
  • STAKE_POOLS_QUIT_01x

Failure / Counter-example

  1) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_QUIT_02 - Passphrase must be correct to quit
       uncaught exception: RequestException
       DecodeFailure "{\"code\":\"network_unreachable\",\"message\":\"The node backend is unreachable at the moment. Trying again in a bit might work.\"}"
  src/Test/Integration/Scenario/API/Shelley/StakePools.hs:850:9: 
  2) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_QUIT_01x - Fee boundary values, STAKE_POOLS_QUIT_01xx - I can quit if I have enough to cover fee
       uncaught exception: RequestException
       DecodeFailure "{\"code\":\"network_unreachable\",\"message\":\"The node backend is unreachable at the moment. Trying again in a bit might work.\"}"

It seems that most recent failures are always preceded by:

  src/Test/Integration/Framework/DSL.hs:1471:16: 
  1) API Specifications, SHELLEY_WALLETS, WALLETS_LIST_01 - Wallets are listed from oldest to newest
       uncaught exception: ErrorCall
       getFromResponse failed to get item
       CallStack (from HasCallStack):
         error, called at src/Test/Integration/Framework/DSL.hs:1471:16 in cardano-wallet-core-integration-2020.12.21-AAYZFneZURcKNEp0gMVmAX:Test.Integration.Framework.DSL

  To rerun use: --match "/API Specifications/SHELLEY_WALLETS/WALLETS_LIST_01 - Wallets are listed from oldest to newest/"

So, it might be the one we ought to investigate.

Resolution

See https://jira.iohk.io/browse/ADP-647

@Anviking Anviking removed the Bug label Nov 19, 2020
iohk-bors bot added a commit that referenced this issue Nov 23, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 23, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 23, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 23, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 23, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 24, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 24, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 24, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=jonathanknowles a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 24, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=jonathanknowles a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 24, 2020
2338: Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending r=jonathanknowles a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 25, 2020
2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Also mark STAKE_POOLS_SMASH_01 pending
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 25, 2020
2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Also mark STAKE_POOLS_SMASH_01 pending
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


2346: get rid of 'OnDanglingChange' option for fee balancing r=KtorZ a=KtorZ

# Issue Number

<!-- Put here a reference to the issue this PR relates to and which requirements it tackles -->

ADP-568

# Overview

<!-- Detail in a few bullet points the work accomplished in this PR -->

- [ ] I have removed 'OnDanglingChange' option for fee balancing.

# Comments

<!-- Additional comments or screenshots to attach if any -->

  This option allowed choosing between two modes: SaveMoney and
  PayAndBalance. The former was used with cardano-node, and the latter
  used with jormungandr. The reason for having a difference is was
  because of discrepency in the minimum transaction expected by both
  ledger. On jormungandr, transactions have to be _exactly_ balanced and
  leave exactly the expected fees required by the network. On the
  counterpart, the fee calculation was only a function of the number of
  inputs and outputs... therefore much easier to satisfy than on
  cardano-node. Now that we've removed jormungandr, this extra
  indirection / complexity is just harmful. Since the 'PayAndBalance'
  mode is never used, I've removed the option entirely and made code
  assume 'SaveMoney' everywhere it used to choose between both
  alternatives.

  This also seemingly remove the 'allowUnbalancedTx' field from the
  transaction layer which was directly related to this option.

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
Co-authored-by: KtorZ <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 25, 2020
2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Also mark STAKE_POOLS_SMASH_01 pending
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
iohk-bors bot added a commit that referenced this issue Nov 25, 2020
2338: Mark TRANS_TTL_{01,02}, STAKE_POOLS_JOIN_05, and STAKE_POOLS_SMASH_01 pending r=Anviking a=Anviking

# Issue Number

None. Addressing CI failures.

# Overview


- [x] Add new `flakyBecauseOf ticketOrReason` helper that calls `pendingWith` unless `RUN_FLAKY_TESTS` is set.
- [x] Mark TRANS_TTL_{01,02} and STAKE_POOLS_JOIN_05 pending/flaky.
- [x] Also mark STAKE_POOLS_SMASH_01 pending
- [x] Add manual test calling for running flaky tests

# Comments

- Should lower the failure rate by 21% of runs, from 59% to 38%. 
- Next candidate for marking pending would be #2224, but with a relatively low failure rate of 3.6%, and being important, I think it would be a bad idea.
- Maybe we should have flaky tests run per default, unless setting `DONT_RUN_FLAKY_TESTS` in CI, to maximise the times we run them locally.

Recent bors failures:
```
succeded: 19, failed: 37 (66%), total: 56
excluding #expected failures

Broken down by tags/issues:
10 times #2292 Flaky test - various DB properties causing timeout | #2292
7 times #2295 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
6 times
3 times #2311 Flaky test - integration test timeout after/related to STAKE_POOLS_LIST_01 | #2311
3 times #2230 Flaky  STAKE_POOLS_JOIN_05 - Can join when stake key already exists | #2230
2 times #2224 Flaky STAKE_POOLS_LIST_01 - List stake pools, has non-zero saturation & stake | #2224
1 times #another-integration-timeout  |
1 times #2337 STAKE_POOLS_GARBAGE_COLLECTION_01 timed out | #2337
1 times #2320 Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 | #2320
1 times #2295, #2331 Flaky TRANS_TTL_{01,02} - SlotNo 80 > SlotNo 50 | #2295
1 times #2207 Flaky SHELLEY_MIGRATE_01_big_wallet | #2207
1 times #2118 Property `prop_rebalanceSelection` occasionally fails. | #2118
```


<!-- Additional comments or screenshots to attach if any -->

<!-- 
Don't forget to:

 ✓ Self-review your changes to make sure nothing unexpected slipped through
 ✓ Assign yourself to the PR
 ✓ Assign one or several reviewer(s)
 ✓ Once created, link this PR to its corresponding ticket
 ✓ Assign the PR to a corresponding milestone
 ✓ Acknowledge any changes required to the Wiki
-->


Co-authored-by: Johannes Lund <[email protected]>
@KtorZ KtorZ changed the title Flaky test - The node backend is unreachable at the moment. STAKE_POOLS_QUIT_02 Flaky test - The node backend is unreachable at the moment. Jan 7, 2021
@rvl
Copy link
Contributor

rvl commented Jan 7, 2021

@KtorZ That error message is a big red herring.
There is no evidence that the node backend is unreachable.
All it means is that some LSQ returned some kind of failure.
We should review and merge #2419 as a matter of urgency, so that we stop seeing this error message, and instead have enough information logged so that we can debug the failure.

@KtorZ
Copy link
Member Author

KtorZ commented Jan 7, 2021

@rvl I agree. The error message isn't really helpful as it is but I am pretty convinced that it would suffice to retry the stakeDistribution query after re-acquiring a more recent point.

@rvl
Copy link
Contributor

rvl commented Jan 12, 2021

I opened an issue - ADP-647.

@rvl rvl closed this as completed Jan 12, 2021
@Anviking
Copy link
Member

Seems odd that this failure only appeared recently though

@Anviking
Copy link
Member

Let's keep this open still.

@rvl
Copy link
Contributor

rvl commented Jan 21, 2021

This should be closed because the tests fail due to an actual bug not a flaky test.
Also the error message has been improved, so the title of this issue is no longer relevant.

@Anviking
Copy link
Member

I think it's useful that new failures end up in the same group as the previous 16 failures, such that people see that this is a common problem, and don't create new tickets for it.

It's still a test failure, even if it is also a bug.

@piotr-iohk
Copy link
Contributor

Would this be a relevant bug report? -> https://jira.iohk.io/browse/ADP-647

@Anviking
Copy link
Member

Yes

@rvl
Copy link
Contributor

rvl commented Jan 21, 2021

OK

iohk-bors bot added a commit that referenced this issue Jan 25, 2021
2449: Re-write LocalStateQuery client logic to eliminate acquire failures r=Anviking a=Anviking

# Issue Number

ADP-647, #2320

<!-- Put here a reference to the issue that this PR relates to and which requirements it tackles. Jira issues of the form ADP- will be auto-linked. -->


# Overview

- [x] Allow `send`-ing a composition of queries against a single acquired point, not just one.
- [x] Makes acquire failures practically impossible
- [x] Re-add multi-era support with reduced boilerplate
- [x] Some polish still needed
- [x] Re-add tracing of query times (less granular than before, but done)

# Comments

Pretty sure this _is_
- eliminating acquire failures
- the right direction

But also
- Might introduce a new set of problems

<!-- Additional comments or screenshots to attach if any -->

## Failures

I have run tests locally a lot on this branch. One new failure I _occasionally_ see is
```
  src/Test/Integration/Framework/DSL.hs:1797:7:
  1) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_JOIN_01rewards - Can join a pool, earn rewards and collect them
       expected a successful response but got an error: DecodeFailure "{\"code\":\"created_invalid_transaction\",\"message\":\"That's embarrassing. It looks like I've created an invalid transaction that could not be parsed by the node. Here's an error message that may help with debugging: HardForkApplyTxErrFromEra S (S (S (Z (WrapApplyTxErr {unwrapApplyTxErr = ApplyTxError [LedgerFailure (DelegsFailure (WithdrawalsNotInRewardsDELEGS (fromList [(RewardAcnt {getRwdNetwork = Mainnet, getRwdCred = KeyHashObj (KeyHash \\\"9c0ff007dd21bbf24960bd12ae4009efb8cad076228ef1a54c7b5dbe\\\")},Coin 7010064794)])))]}))))\"}"
       While verifying (Status {statusCode = 500, statusMessage = "Internal Server Error"},Left (DecodeFailure "{\"code\":\"created_invalid_transaction\",\"message\":\"That's embarrassing. It looks like I've created an invalid transaction that could not be parsed by the node. Here's an error message that may help with debugging: HardForkApplyTxErrFromEra S (S (S (Z (WrapApplyTxErr {unwrapApplyTxErr = ApplyTxError [LedgerFailure (DelegsFailure (WithdrawalsNotInRewardsDELEGS (fromList [(RewardAcnt {getRwdNetwork = Mainnet, getRwdCred = KeyHashObj (KeyHash \\\"9c0ff007dd21bbf24960bd12ae4009efb8cad076228ef1a54c7b5dbe\\\")},Coin 7010064794)])))]}))))\"}"))

  To rerun use: --match "/API Specifications/SHELLEY_STAKE_POOLS/STAKE_POOLS_JOIN_01rewards - Can join a pool, earn rewards and collect them/"
```
Important part is `WithdrawalsNotInRewardsDELEGS`. So seems we are not aware of the rewards having already been spent. Maybe this PR makes the rewards slower to update, somehow, not sure.

Edit: digging through my notes, I _have_ seen this failure on another branch — once. I think this PR makes it more likely to occur, but not be inherently related, then.

Co-authored-by: Johannes Lund <[email protected]>
Co-authored-by: Rodney Lorrimar <[email protected]>
@rvl rvl mentioned this issue Jan 25, 2021
7 tasks
iohk-bors bot added a commit that referenced this issue Jan 25, 2021
2449: Re-write LocalStateQuery client logic to eliminate acquire failures r=Anviking a=Anviking

# Issue Number

ADP-647, #2320

<!-- Put here a reference to the issue that this PR relates to and which requirements it tackles. Jira issues of the form ADP- will be auto-linked. -->


# Overview

- [x] Allow `send`-ing a composition of queries against a single acquired point, not just one.
- [x] Makes acquire failures practically impossible
- [x] Re-add multi-era support with reduced boilerplate
- [x] Some polish still needed
- [x] Re-add tracing of query times (less granular than before, but done)

# Comments

Pretty sure this _is_
- eliminating acquire failures
- the right direction

But also
- Might introduce a new set of problems

<!-- Additional comments or screenshots to attach if any -->

## Failures

I have run tests locally a lot on this branch. One new failure I _occasionally_ see is
```
  src/Test/Integration/Framework/DSL.hs:1797:7:
  1) API Specifications, SHELLEY_STAKE_POOLS, STAKE_POOLS_JOIN_01rewards - Can join a pool, earn rewards and collect them
       expected a successful response but got an error: DecodeFailure "{\"code\":\"created_invalid_transaction\",\"message\":\"That's embarrassing. It looks like I've created an invalid transaction that could not be parsed by the node. Here's an error message that may help with debugging: HardForkApplyTxErrFromEra S (S (S (Z (WrapApplyTxErr {unwrapApplyTxErr = ApplyTxError [LedgerFailure (DelegsFailure (WithdrawalsNotInRewardsDELEGS (fromList [(RewardAcnt {getRwdNetwork = Mainnet, getRwdCred = KeyHashObj (KeyHash \\\"9c0ff007dd21bbf24960bd12ae4009efb8cad076228ef1a54c7b5dbe\\\")},Coin 7010064794)])))]}))))\"}"
       While verifying (Status {statusCode = 500, statusMessage = "Internal Server Error"},Left (DecodeFailure "{\"code\":\"created_invalid_transaction\",\"message\":\"That's embarrassing. It looks like I've created an invalid transaction that could not be parsed by the node. Here's an error message that may help with debugging: HardForkApplyTxErrFromEra S (S (S (Z (WrapApplyTxErr {unwrapApplyTxErr = ApplyTxError [LedgerFailure (DelegsFailure (WithdrawalsNotInRewardsDELEGS (fromList [(RewardAcnt {getRwdNetwork = Mainnet, getRwdCred = KeyHashObj (KeyHash \\\"9c0ff007dd21bbf24960bd12ae4009efb8cad076228ef1a54c7b5dbe\\\")},Coin 7010064794)])))]}))))\"}"))

  To rerun use: --match "/API Specifications/SHELLEY_STAKE_POOLS/STAKE_POOLS_JOIN_01rewards - Can join a pool, earn rewards and collect them/"
```
Important part is `WithdrawalsNotInRewardsDELEGS`. So seems we are not aware of the rewards having already been spent. Maybe this PR makes the rewards slower to update, somehow, not sure.

Edit: digging through my notes, I _have_ seen this failure on another branch — once. I think this PR makes it more likely to occur, but not be inherently related, then.

Co-authored-by: Johannes Lund <[email protected]>
Co-authored-by: Rodney Lorrimar <[email protected]>
@Anviking
Copy link
Member

Should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants