You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We traced a test failure in #6502 down to the durable object state not actually getting written to the DB. The data was added to the LRU cache at the right time, but (being a cache) it doesn't call vatstoreSet() right away. If the vat sees a lot of other activity, eventually the data will get evicted from the LRU cache, but at the very least we must ensure it gets written before the vat shuts down (e.g. for upgrade).
When a vat is upgraded, we send the old worker a stopVat(), let it complete, then replace the worker with the new version (and send a startVat to the new one). The stopVat() is supposed to flush everything (in particular, it should call the LRU's flush() function). stopVat() used to call bringOutYourDead(), which does a LRU flush(), but that BOYD call was accidentally commented out in commit 91480de (part of #5342), where we decided that the risk of crashes during upgrade was worse than the storage leak of not entirely deleting the old state data, and we disabled a lot of the deletion code.
The BOYD appeared in the middle of a big block of code with 6-ish other calls. We used a block comment to disable the whole section (so the diff I reviewed didn't display the BOYD call), and we didn't realize that we really needed that BOYD to flush the LRU cache. And we didn't have a test that was sensitive to this case (we have lots of upgrade tests, but either they did their best to disable the LRU cache, for simplicity, or they were using enough durable objects that the LRU cache was being cycled naturally, so we weren't completely depending upon the flush()).
The fix is to re-enable that bringOutYourDead() in the stopVat() process (during releaseOldState() in stop-vat.js).
And of course we should add a new test to catch this case. It should use the default LRU cache size (maybe 100 or so), create a vat with a single durable object, not do other activity, then perform a null upgrade. The second incarnation should attempt to access the old object.
We can mostly copy this test from @gibson042 's PR #6502 , just without the PublisherKit stuff: one vivifyDurable.. of some trivial Kind, then provide an instance.
Describe the bug
We traced a test failure in #6502 down to the durable object state not actually getting written to the DB. The data was added to the LRU cache at the right time, but (being a cache) it doesn't call
vatstoreSet()
right away. If the vat sees a lot of other activity, eventually the data will get evicted from the LRU cache, but at the very least we must ensure it gets written before the vat shuts down (e.g. for upgrade).When a vat is upgraded, we send the old worker a
stopVat()
, let it complete, then replace the worker with the new version (and send astartVat
to the new one). ThestopVat()
is supposed to flush everything (in particular, it should call the LRU'sflush()
function).stopVat()
used to callbringOutYourDead()
, which does a LRUflush()
, but that BOYD call was accidentally commented out in commit 91480de (part of #5342), where we decided that the risk of crashes during upgrade was worse than the storage leak of not entirely deleting the old state data, and we disabled a lot of the deletion code.agoric-sdk/packages/SwingSet/src/liveslots/stop-vat.js
Line 348 in 2ddcdf6
The BOYD appeared in the middle of a big block of code with 6-ish other calls. We used a block comment to disable the whole section (so the diff I reviewed didn't display the BOYD call), and we didn't realize that we really needed that BOYD to flush the LRU cache. And we didn't have a test that was sensitive to this case (we have lots of upgrade tests, but either they did their best to disable the LRU cache, for simplicity, or they were using enough durable objects that the LRU cache was being cycled naturally, so we weren't completely depending upon the
flush()
).The fix is to re-enable that
bringOutYourDead()
in thestopVat()
process (duringreleaseOldState()
instop-vat.js
).And of course we should add a new test to catch this case. It should use the default LRU cache size (maybe 100 or so), create a vat with a single durable object, not do other activity, then perform a null upgrade. The second incarnation should attempt to access the old object.
We can mostly copy this test from @gibson042 's PR #6502 , just without the PublisherKit stuff: one
vivifyDurable..
of some trivial Kind, thenprovide
an instance.cc @FUDCo
The text was updated successfully, but these errors were encountered: