You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mhofman and I were investigating #5640 by looking at the kernel stats. His 20-hour loadgen run made a copy of the kernel DB about once every two hours, and the self-reported kernel object counts were rising dramatically:
node ➜ ~/agoric-sdk/packages/SwingSet $ for i in 0 1 2 3 4 5 6 7 8 9; do node misc-tools/db-get.js --raw ../../loadgen-state/stage-$i-state local.kernelStats|jq .kernelObjects; done
511
11140
21765
32410
42982
53372
63867
74329
84766
95374
however the corresponding number of c-list entries was not:
node ➜ ~/agoric-sdk/packages/SwingSet $ for i in 0 1 2 3 4 5 6 7 8 9; do node misc-tools/db-get.js --raw ../../loadgen-state/stage-$i-state local.kernelStats|jq .clistEntries; done
851
884
872
893
881
885
891
893
929
917
So, what's holding onto the 95k kernel objects? Let's start by picking an ID at random. Each kernel object table entry has two DB keys: one for the owner (a vatID), and a second for the refcount.
We want to pick an entry from the middle ages of this chain: objects allocated very near the start of the chain are likely to be legitimately retained (they're probably fundamental things like Zoe or a contract installation handle), and objects allocated near the end are likely to be retained by recent loadgen operations that are not yet complete. Sort of a bathtub curve for expected object lifetime.
Hang on.. the kernel stats said there were 95k objects, which should have 2x = 190k DB keys, and yet we're only seeing 649 keys? Something is lying to us.
Welp, it looks like we forgot to ever decrement the kernelObjects counter (or more likely, it got lost in the course of some refactoring). So in fact we only have (649-1)/2 = 324 kernel objects, which sounds pretty flat to me.
So the task is to insert a decStat('kernelObjects') into kernelKeeper.js > deleteKernelObject(), and of course write a test that makes sure it doesn't get dropped again. And then we can run a new loadgen run and see if the telemetry-reported object count remains as stable as this makes us expect (and then maybe close #5640).
The text was updated successfully, but these errors were encountered:
Oops, we incremented this counter, but never decremented it. So our
reported `kernelObjects` value was drastically inflated, making it
look like we were leaking objects, when we probably aren't.
closes#5652
@mhofman and I were investigating #5640 by looking at the kernel stats. His 20-hour loadgen run made a copy of the kernel DB about once every two hours, and the self-reported kernel object counts were rising dramatically:
however the corresponding number of c-list entries was not:
So, what's holding onto the 95k kernel objects? Let's start by picking an ID at random. Each kernel object table entry has two DB keys: one for the owner (a vatID), and a second for the refcount.
We want to pick an entry from the middle ages of this chain: objects allocated very near the start of the chain are likely to be legitimately retained (they're probably fundamental things like Zoe or a contract installation handle), and objects allocated near the end are likely to be retained by recent loadgen operations that are not yet complete. Sort of a bathtub curve for expected object lifetime.
So let's see how many DB entries we have:
Hang on.. the kernel stats said there were 95k objects, which should have 2x = 190k DB keys, and yet we're only seeing 649 keys? Something is lying to us.
Welp, it looks like we forgot to ever decrement the
kernelObjects
counter (or more likely, it got lost in the course of some refactoring). So in fact we only have(649-1)/2 =
324 kernel objects, which sounds pretty flat to me.So the task is to insert a
decStat('kernelObjects')
intokernelKeeper.js > deleteKernelObject()
, and of course write a test that makes sure it doesn't get dropped again. And then we can run a new loadgen run and see if the telemetry-reported object count remains as stable as this makes us expect (and then maybe close #5640).The text was updated successfully, but these errors were encountered: