Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(notifier): subscribeEach/subscribeLatest iterators retry when broken by vat upgrade #7401

Merged
merged 11 commits into from
Apr 25, 2023

Conversation

gibson042
Copy link
Member

@gibson042 gibson042 commented Apr 12, 2023

Fixes #5185

Description

"latest" iterators look for rejections that indicate failure due to vat upgrade, and upon encountering them retry the getUpdateSince call—if the source is durable, the retry will succeed and provide the next iteration result. If the source is not durable, the retry will go splat and the error associated with that will be propagated in place of the vatUpgraded rejection (which is perhaps unfortunate, but seems to be the only way of guaranteeing that a real fail result with unfortunate timing is not replaced by a vatUpgraded rejection).

"each" iterators behave similarly, but additionally fail if multiple values are published before they can successfully reconnect because they cannot tolerate gaps in the sequence.

Security Considerations

A malicious source can fake vatUpgraded rejections, which the "latest" iterators will dutifully accept and respond to by retrying as long as incarnationNumber indicates forward progress. Something like this is necessary to handle the edge case where a second upgrade occurs in between the consumer receiving a rejection and the producer receiving the retry, but we could put in a retry count limit if this is perceived as an actual risk.

Scaling Considerations

I don't think this affects scaling.

Documentation Considerations

These relatively fine implementation details are covered by JSDoc comments.

Testing Considerations

The rapid-upgrade edge case is not covered by testing, but I consider that acceptable. If reviewers disagree, I think I can add a new file with manually-constructed mock producers.

@mhofman
Copy link
Member

mhofman commented Apr 12, 2023

subscribeEach iterators continue to fail in that scenario, because they cannot guarantee absence of gaps.

That makes sense. But I'm somewhat worried about what that means for these consumers, especially since it seems they may be using subscibeEach through layers of utils without knowing. That said I don't see the alternative.

seems to be the only way of guaranteeing that a real fail result with unfortunate timing is not replaced by a vatUpgraded rejection).

Could we annotate the rejection with the upgraded rejection?

@gibson042 gibson042 force-pushed the gibson-5185-reconnect-subscribers branch 3 times, most recently from 548ed4c to 36ee17b Compare April 13, 2023 16:40
@gibson042 gibson042 force-pushed the gibson-5185-reconnect-subscribers branch 2 times, most recently from aad76d2 to 0e926a7 Compare April 16, 2023 18:00
@gibson042 gibson042 changed the title feat(notifier): subscribeLatest iterators retry when broken by vat upgrade feat(notifier): subscribeEach/subscribeLatest iterators retry when broken by vat upgrade Apr 16, 2023
@gibson042
Copy link
Member Author

Updated to opportunistically reconnect subscribeEach iterators when possible per #5185 (comment) .

@gibson042
Copy link
Member Author

gibson042 commented Apr 16, 2023

This PR introduces a locally-reproducible failure for the "Watch interest accrue" test in test-stakeFactory.js that might be real—d.waitDays(90) is followed by d.checkRUNDebt(195n), which compares E(vault).getCurrentDebt() against 195e6 and comes up just outside the acceptable tolerance of 0.2e6:

    name: AssertionError
    assertion: deepEqual
    values:
      'Difference (- actual, + expected):': |2-
          {
            brand: Object @Alleged: IST brand {},
        -   value: 194796320n,
        +   value: 195000000n,
          }
    at: >-
      approxEqual
      (packages/inter-protocol/test/stakeFactory/test-stakeFactory.js:507:7)

      Object.checkRUNDebt
      (packages/inter-protocol/test/stakeFactory/test-stakeFactory.js:742:7)

      async packages/inter-protocol/test/stakeFactory/test-stakeFactory.js:991:3

I'm not sure how E(vault).getCurrentDebt() depends upon subscribeEach or subscribeLatest, but I can confirm that the reported value in that test is affected by the count of awaits they entail (e.g., removing the await null in reconnectAsNeeded would reduce that 194.796320e6 to 194.730620e6). The situation reads to me like sensitivity to artificial testing details and can be "fixed" by increasing the value of epsilon from micro.unit / 5n to e.g. micro.unit / 4n, but I'd like confirmation from @dckc and/or @turadg that such a change would maintain the intent of the stakeFactory tests (which were added in entirety by #4741), and also if possible a suggested explanation of what epsilon values are appropriate in case further changes are warranted in the future.

@turadg
Copy link
Member

turadg commented Apr 18, 2023

@gibson042 thanks for the detailed question. @dckc and I conferred and agree with your conclusion. For appropriate epsilon value, go with whatever works because StakeFactory is not scheduled for any release.

Copy link
Member

@michaelfig michaelfig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few comments that shouldn't block approval.

});
harden(makeUpgradeDisconnection);

// TODO: Simplify once we have @endo/patterns (or just export the shape).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have @endo/patterns now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first attempt didn't quite work, so I'm deferring this until we've got a little more time to spend on it.

Comment on lines 120 to 122
// rejections here too to avoid invalid unhandled rejection issues later.
void E.when(publishCountP, sink, sink);
void E.when(nextCellP, sink, sink);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain about why publishCountP is sunk, but tailP (originally pubList at this point) is not. Can you elaborate (either by replying here, or in a code comment) as to why these two calls are necessary?

I suspect that there may be a clearer way to avoid introducing any unhandled rejections with minimal sinks. I just don't see it right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any rejection of tailP would be handled in reconnectAsNeeded.

Comment on lines +148 to +149
const firstCellP = reconnectAsNeeded(() => E(topic).subscribeAfter());
return makeEachIterator(topic, firstCellP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really clear!

@gibson042 gibson042 force-pushed the gibson-5185-reconnect-subscribers branch from 2ec199f to e6d65d0 Compare April 25, 2023 17:08
@gibson042 gibson042 added the automerge:no-update (expert!) Automatically merge without updates label Apr 25, 2023
@mergify mergify bot merged commit bf2a9ff into master Apr 25, 2023
@mergify mergify bot deleted the gibson-5185-reconnect-subscribers branch April 25, 2023 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:no-update (expert!) Automatically merge without updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notifier observers to discriminate between vat failure and upgrade
4 participants