investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

warner · 2022-12-05T19:09:05Z

We're investigating a slowdown of the mainnet chain that happend around block 7709170 , about 10:30am PST on monday 05-dec-2022. Several validators fell behind during this time, with one block being delayed by over two minutes.

Preliminary investigating suggests that a governance proposal passed and became active about this time, causing a large chunk of JS to be executed via the "Big JS Hammer" (core-bootstrap eval) upgrade pathway. This can be expected to take non-trivial time to complete, however the work should have been broken up into multiple blocks. It's possible that the computron limit was not being enforced correctly, allowing the block to take longer than it should have.

We'll collect information and findings in this ticket.

The text was updated successfully, but these errors were encountered:

warner · 2022-12-05T19:17:45Z

Here's the output of packages/SwingSet/misc-tools/monitor-slog-block-time.py on the blocks immediately before and after the problem:

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709155    6( 6.5)   6.494 ->    6(  3.0)     188192  0.047( 0.071)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709156    6( 6.5)   6.829 ->    0(  2.6)          0  0.003( 0.069)  +   0.015 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709157    6( 6.5)   6.730 ->    0(  2.6)          0  0.003( 0.069)  +   0.005 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709158    6( 6.5)   6.707 ->    0(  2.6)          0  0.002( 0.069)  +   0.003 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709159    6( 6.5)   6.698 ->    0(  2.4)          0  0.003( 0.067)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709160    6( 6.2)   6.680 ->    6(  2.6)     175502  0.047( 0.069)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709161    6( 6.2)   6.706 ->    6(  3.0)     188192  0.045( 0.071)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709162    6( 6.2)   6.697 ->    0(  3.0)          0  0.003( 0.071)  +   0.003 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709163    6( 6.2)  11.264 ->    0(  2.6)          0  0.002( 0.052)  +   0.003 =   0% ( 0.9) [  0/  0/  0] r0 ()
  7709164   10( 6.2)   7.137 ->    0(  2.6)          0  0.002( 0.052)  +   0.006 =   0% ( 0.9) [  0/  0/  0] r0 ()
  7709165    6( 6.2)   7.069 ->  950( 50.1)  239910012 20.942( 1.099)  +   2.592 =  79% ( 4.9) [  0/  0/  0] r0 ()
  7709166    6( 6.2)  26.562 -> 1171(108.7)   65050750 31.495( 2.674)  +   0.021 =  94% ( 9.6) [  0/  0/  0] r0 ()
  7709167   25( 7.2)  37.407 -> 1022(159.4)   65101294 19.384( 3.641)  +   0.006 =  81% (13.6) [  0/  0/  0] r0 ()
  7709168   36( 8.7)  22.958 -> 1321(225.5)   65111031 28.176( 5.050)  +   0.003 =  92% (18.2) [  0/  0/  0] r0 ()
  7709169   22( 9.5)  35.419 -> 1235(287.2)   65027410 122.143(11.157)  +   0.033 =  95% (23.0) [  0/  0/  0] r0 ()
  7709170   34(10.9)  143.230 ->  652(319.9)   65181729 11.763(11.745)  +   0.197 =  37% (24.9) [  0/  0/  0] r0 ()
  7709171   142(17.7)  17.811 ->  654(351.1)   65342884 74.501(15.428)  +   0.046 =  94% (29.0) [  0/  0/  0] r0 ()
  7709172   17(18.2)  80.691 ->  654(383.9)   65342884 24.169(16.637)  +   0.023 =  81% (33.1) [  0/  0/  0] r0 ()
  7709173   80(21.9)  29.061 ->  684(418.1)   65276501 29.048(18.089)  +   0.006 =  87% (37.4) [  0/  0/  0] r0 ()
  7709174   28(23.0)  31.720 ->  710(453.6)   65053869 54.125(20.795)  +   0.005 =  97% (42.3) [  0/  0/  0] r0 ()
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709175   31(24.2)  66.002 ->  710(488.8)   65024668 29.351(22.260)  +   0.075 =  72% (45.9) [  0/  0/  0] r0 ()
  7709176   65(27.2)  34.031 ->  710(524.2)   65335032 55.021(25.011)  +   0.006 =  93% (50.6) [  0/  0/  0] r0 ()
  7709177   33(28.6)  66.113 ->  740(561.2)   65120336 14.889(25.756)  +   0.008 =  59% (53.5) [  0/  0/  0] r0 ()
  7709178   65(31.5)  22.919 ->  734(598.0)   65324306 20.057(26.758)  +   0.021 =  74% (57.2) [  0/  0/  0] r0 ()
  7709179   22(32.3)  25.699 ->  736(634.8)   65077924 30.291(28.273)  +   0.011 =  86% (61.6) [  0/  0/  0] r0 ()
  7709180   25(33.2)  34.337 ->  651(667.0)   65132172 25.147(29.528)  +   0.004 =  88% (65.9) [  0/  0/  0] r0 ()
  7709181   33(34.6)  33.710 ->  603(696.9)   65353411 27.246(30.888)  +   0.005 =  79% (69.8) [  0/  0/  0] r0 ()
  7709182   33(36.0)  45.166 ->  603(727.0)   65348874 25.336(32.154)  +   0.057 =  59% (72.8) [  0/  0/  0] r0 ()
  7709183   44(37.9)  38.478 ->  603(757.1)   65351449 27.391(33.524)  +   0.029 =  69% (76.3) [  0/  0/  0] r0 ()
  7709184   37(39.2)  33.069 ->  603(787.3)   65358283 29.154(34.981)  +   0.009 =  87% (80.7) [  0/  0/  0] r0 ()
  7709185   32(40.5)  36.877 ->  603(770.0)   65361309 28.131(35.341)  +   0.012 =  80% (80.7) [  0/  0/  0] r0 ()
  7709186   36(42.0)  32.898 -> 1070(764.9)   65035483 24.044(34.968)  +   0.005 =  86% (80.3) [  0/  0/  0] r0 ()
  7709187   32(42.4)  26.881 -> 1671(797.4)   65080484 32.178(35.608)  +   0.016 =  94% (81.0) [  0/  0/  0] r0 ()
  7709188   26(41.9)  39.221 -> 1671(814.9)   65080484 18.501(35.124)  +   0.050 =  75% (80.1) [  0/  0/  0] r0 ()
  7709189   38(42.6)  46.131 ->  499(778.0)   65642797 36.767(30.855)  +   0.011 =  58% (78.2) [  0/  0/  0] r0 ()
  7709190   45(43.2)  56.552 ->  485(769.7)   65598691 18.171(31.176)  +   0.005 =  49% (78.8) [  0/  0/  0] r0 ()
  7709191   55(38.9)  31.413 ->  484(761.2)   65591264 14.048(28.153)  +   0.022 =  54% (76.8) [  0/  0/  0] r0 ()
  7709192   30(39.5)  21.235 ->  484(752.7)   65587795 12.058(27.548)  +   0.006 =  67% (76.1) [  0/  0/  0] r0 ()
  7709193   20(36.5)  14.711 ->  484(742.7)   65589535 12.031(26.697)  +   0.003 =  89% (76.2) [  0/  0/  0] r0 ()
  7709194   14(35.8)  18.415 ->  484(731.4)   65590781 10.169(24.499)  +   0.059 =  64% (74.6) [  0/  0/  0] r0 ()
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709195   12(34.9)  27.489 ->  484(720.1)   65592283 20.771(24.070)  +   0.028 =  65% (74.2) [  0/  0/  0] r0 ()
  7709196   14(32.3)  45.895 ->  485(708.9)   65600037 11.029(21.870)  +   0.022 =  48% (72.0) [  0/  0/  0] r0 ()

blocks continue to be full (about 65M computrons each) until block 7709257, where the queue is finally drained and we're able to catch up:

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709197   39(19.8)  22.831 ->  484(363.2)   65596431  9.008(12.744)  +   0.005 =  64% (60.9) [ 68/  0/ 68] r3 (BB5E2C)
  7709198   22(20.2)  15.073 ->  561(402.8)   65563866 21.843(14.564)  +   0.003 =  80% (64.9) [ 72/  0/ 72] r2 (BB5E2C)
  7709199   14(19.2)  24.778 ->  619(438.8)   65005698  8.583(13.567)  +   0.003 =  82% (67.8) [ 65/  0/ 65] r1 (BB5E2C)
  7709200   19(19.1)  16.110 ->  617(464.3)   65285790  8.785(12.884)  +   0.006 =  83% (70.0) [ 73/  0/ 73] r0 (2A3AC9)
  7709201   10(18.0)  20.463 ->  619(483.6)   65005545  8.849(12.380)  +   0.027 =  61% (68.9) [ 71/  0/ 71] r0 (CFC45F)
  7709202   15(17.7)  15.666 ->  618(498.6)   65290598  8.964(12.000)  +   0.003 =  87% (71.0) [ 71/  0/ 71] r1 (CFC45F)
  7709203   10(16.9)  15.298 ->  619(510.6)   65005683  8.846(11.685)  +   0.022 =  93% (73.2) [ 73/  0/ 73] r0 (E29E61)
  7709204   10(16.3)  20.500 ->  617(520.3)   65284344  8.575(11.402)  +   0.015 =  57% (71.7) [ 71/  0/ 71] r0 (23E1B0)
  7709205   14(16.1)  15.608 ->  619(528.5)   65006237  9.291(11.226)  +   0.003 =  94% (73.7) [ 71/  0/ 71] r1 (23E1B0)
  7709206   10(15.6)  29.852 ->  818(550.8)   65119238 25.071(12.291)  +   0.015 =  62% (72.8) [ 72/  0/ 72] r0 (47C2F8)
  7709207   15(15.6)  49.541 -> 1002(583.0)   65108717 28.765(13.468)  +   0.049 =  75% (73.0) [ 67/  0/ 67] r1 (BF9BF3)
  7709208   32(16.7)  46.817 -> 1005(611.1)   65016753 18.499(13.803)  +   0.047 =  97% (74.6) [ 65/  0/ 65] r2 (BF9BF3)
  7709209   33(17.7)  48.654 -> 1006(635.8)   65012879 35.034(15.130)  +   0.011 =  68% (74.2) [ 72/  0/ 72] r1 (996453)
  7709210   28(18.3)  56.596 ->  966(655.2)   65178791 27.960(15.885)  +   0.020 =  96% (75.5) [ 70/  0/ 70] r2 (996453)
  7709211   40(19.5)  45.159 ->  924(670.2)   65002751 10.241(15.571)  +   0.004 =  94% (76.6) [ 70/  0/ 70] r1 (BB5E2C)
  7709212   35(20.3)  21.085 ->  924(683.5)   65004071 10.268(15.292)  +   0.003 =  93% (77.5) [ 71/  0/ 71] r1 (BB5E2C)
  7709213   12(19.9)  40.445 ->  542(676.5)  250284328 44.655(16.760)  +   0.006 =  67% (77.0) [ 73/  0/ 73] r0 (9A4CCB)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709214   15(19.9)  71.403 ->  416(697.2)   65365580  8.509(16.677)  +   0.059 =  86% (78.2) [ 71/  0/ 71] r1 (026653)
  7709215   49(21.8)  39.396 ->  415(693.8)   65358871  7.060(15.992)  +   0.003 =  45% (77.2) [ 76/  0/ 76] r1 (026653)
  7709216    9(21.6)  37.957 ->  415(690.3)   65357798  7.450(15.813)  +   0.004 =  93% (79.4) [ 72/  0/ 72] r0 (3B6773)
  7709217    8(20.0)  37.968 ->  415(686.9)   65358636 20.962(16.410)  +   0.045 =  97% (81.0) [ 76/  0/ 76] r0 (CFC45F)
  7709218    9(19.4)  52.596 ->  408(679.2)   65059232  5.153(15.576)  +   0.008 =  66% (80.3) [ 67/  0/ 67] r0 (E34C31)
  7709219   24(19.9)  42.313 ->  405(668.5)   65404941  4.542(15.374)  +   0.006 =  34% (77.9) [ 75/  0/ 75] r1 (E34C31)
  7709220    6(19.2)  41.316 ->  405(657.9)   65404691  4.611(15.165)  +   0.009 =  91% (78.3) [ 77/  0/ 77] r0 (E14F81)
  7709221    7(19.1)  43.266 ->  405(647.2)   65403452  4.622(14.954)  +   0.005 =  51% (77.8) [ 77/  0/ 77] r0 (27C9BF)
  7709222    6(18.6)  42.273 ->  405(636.5)   65404398  4.466(14.729)  +   0.005 =  92% (78.1) [ 76/  0/ 76] r0 (E29E61)
  7709223    6(18.4)  41.145 ->  940(652.6)   65072377 19.573(15.265)  +   0.003 =  97% (78.3) [ 77/  0/ 77] r0 (2A3AC9)
  7709224    6(18.2)  57.550 -> 1238(683.6)   65004024 12.381(15.456)  +   0.003 =  81% (79.5) [ 71/  0/ 71] r0 (BB5E2C)
  7709225   18(18.4)  56.115 -> 1405(723.0)   65005301 11.064(15.544)  +   0.021 =  72% (78.4) [ 70/  0/ 70] r0 (D97CBE)
  7709226   18(18.8)  52.699 ->  551(709.6)   65016601  6.953(14.638)  +   0.006 =  66% (78.6) [ 78/  0/ 78] r1 (D97CBE)
  7709227   10(18.6)  50.483 ->  336(676.3)   65316156  5.489(13.475)  +   0.017 =  86% (79.2) [ 74/  0/ 74] r0 (A21688)
  7709228    7(17.3)  49.547 ->  336(642.9)   65315525 15.434(13.321)  +   0.005 =  96% (79.2) [ 77/  0/ 77] r0 (A177B0)
  7709229   12(16.2)  55.343 ->  337(609.4)   65321205  5.984(11.869)  +   0.015 =  71% (79.3) [ 61/  0/ 61] r1 (765A78)
  7709230   19(15.8)  42.847 ->  336(577.9)   65315624  5.864(10.764)  +   0.004 =  92% (79.1) [ 69/  0/ 69] r1 (765A78)
  7709231    8(14.2)  41.466 ->  385(551.0)   65191168  4.897(10.497)  +   0.009 =  86% (78.7) [ 82/  0/ 82] r0 (BB89FD)
  7709232    6(12.8)  43.377 ->  390(524.2)   65708481  4.795(10.223)  +   0.005 =  61% (77.1) [ 79/  0/ 79] r0 (0DB33A)
  7709233    6(12.4)  55.678 ->  391(516.7)   65714431  5.134( 8.247)  +   0.029 =  27% (75.1) [ 80/  0/ 80] r0 (CFC45F)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709234    7(12.1)  54.316 ->  391(515.5)   65713231  4.657( 8.055)  +   0.003 =  90% (75.3) [ 71/  0/ 71] r0 (5F329F)
  7709235    6( 9.9)  53.643 ->  390(514.2)   65708086  4.610( 7.932)  +   0.012 =  87% (77.3) [ 80/  1/ 81] r0 (843872)
  7709236   11(10.0)  47.731 ->  943(540.6)   65126968 10.045( 8.062)  +   0.006 =  95% (77.4) [ 67/  0/ 67] r1 (843872)
  7709237    8(10.0)  50.380 -> 1287(584.2)   65398089 22.792( 8.153)  +   0.004 =  97% (77.4) [ 69/  0/ 69] r0 (BB5E2C)
  7709238    8( 9.9)  73.646 -> 1642(645.9)   65050770 10.633( 8.427)  +   0.016 =  55% (76.9) [ 75/  0/ 75] r0 (C22CB0)
  7709239   20( 9.8)  65.480 ->  336(642.5)   65445662  6.789( 8.540)  +   0.057 =  85% (79.4) [ 73/  0/ 73] r0 (06AB35)
  7709240   15(10.2)  58.000 ->  336(639.0)   65314320  5.479( 8.583)  +   0.004 =  89% (79.3) [ 75/  0/ 75] r1 (06AB35)
  7709241    7(10.2)  57.175 ->  336(635.5)   65313635  6.510( 8.678)  +   0.030 =  90% (81.3) [ 75/  0/ 75] r0 (9E6A8A)
  7709242    6(10.2)  58.190 ->  336(632.1)   65313202  6.688( 8.789)  +   0.007 =  93% (81.3) [ 75/  0/ 75] r0 (E29E61)
  7709243    6(10.2)  61.644 ->  342(602.2)  278147375 21.960( 8.908)  +   0.005 =  88% (80.9) [ 75/  0/ 75] r0 (A9F209)
  7709244    7(10.2)  77.019 ->  386(559.6)   65570782  4.479( 8.513)  +   0.029 =  91% (81.4) [ 74/  0/ 74] r0 (2A3AC9)
  7709245   21(10.4)  71.150 ->  390(508.9)   65708256  4.443( 8.182)  +   0.023 =  29% (79.2) [ 74/  0/ 74] r1 (2A3AC9)
  7709246    6( 9.8)  70.097 ->  390(500.8)   65707574  5.156( 8.092)  +   0.016 =  91% (80.5) [ 75/  1/ 76] r0 (BCF291)
  7709247    7( 9.7)  73.149 ->  390(503.5)   65707249  4.486( 8.042)  +   0.005 =  47% (78.5) [ 76/  0/ 76] r0 (CFC45F)
  7709248   11( 9.8)  67.156 ->  391(506.2)   65714344  4.719( 7.506)  +   0.005 =  90% (78.2) [ 79/  0/ 79] r1 (CFC45F)
  7709249    6( 9.6)  78.223 -> 1131(546.0)   65078323 10.731( 7.743)  +   0.003 =  46% (77.0) [ 78/  0/ 78] r0 (BB5E2C)
  7709250    6( 8.9)  86.226 -> 1339(596.1)   65038299 11.131( 8.007)  +   0.003 =  77% (76.2) [ 70/  0/ 70] r0 (8BA38C)
  7709251   11( 9.1)  87.041 -> 1554(654.5)   65029239 22.834( 8.904)  +   0.005 =  97% (76.7) [ 73/  0/ 73] r0 (0DB33A)
  7709252   10( 9.2)  100.462 -> 1399(705.0)   65051279 12.786( 9.303)  +   0.034 =  95% (78.4) [ 73/  0/ 73] r0 (0FED4E)
  7709253   22(10.1)  91.695 -> 1114(741.1)   65069571 20.106(10.052)  +   0.017 =  97% (82.0) [ 73/  0/ 73] r1 (DC4CCA)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709254   16(10.5)  96.330 -> 1409(792.0)   65086102 11.202(10.379)  +   0.017 =  95% (82.2) [ 73/  0/ 73] r1 (1CE894)
  7709255   21(11.2)  87.210 -> 1485(846.8)   65039781 12.390(10.768)  +   0.027 =  94% (82.6) [ 74/  0/ 74] r1 (1CE894)
  7709256    9(11.2)  91.243 ->  941(846.7)   61174777 22.553(11.393)  +   0.003 =  97% (82.7) [ 74/  0/ 74] r0 (4EBC59)
  7709257   11(11.3)  103.351 ->    6(782.6)     219917  0.040(10.256)  +   0.025 =  10% (78.3) [ 73/  0/ 73] r0 (E29E61)
  7709258   39(12.8)  64.821 ->    0(700.5)          0  0.002( 9.724)  +   0.005 =   1% (75.6) [ 80/  2/ 82] r3 (32B4F1)
  7709259    6(12.2)  59.245 ->    0(683.8)          0  0.002( 9.385)  +   0.005 =   1% (71.4) [ 82/  0/ 82] r0 (E29E61)
  7709260   11(11.9)  48.725 ->    6(667.2)     188192  0.037( 9.113)  +   0.008 =   8% (67.4) [ 82/  0/ 82] r1 (E29E61)
  7709261    6(11.9)  43.447 ->    0(650.5)          0  0.002( 8.787)  +   0.003 =   0% (62.9) [ 82/  1/ 83] r0 (BB5E2C)
  7709262   11(12.2)  32.892 ->    0(633.6)          0  0.002( 8.453)  +   0.003 =   1% (58.3) [ 83/  1/ 84] r1 (BB5E2C)
  7709263    6(12.2)  27.301 ->    0(616.5)          0  0.002( 7.355)  +   0.009 =   2% (54.0) [ 84/  0/ 84] r0 (CFC45F)
  7709264    6(12.1)  21.721 ->    0(597.2)          0  0.002( 7.131)  +   0.003 =   1% (49.5) [ 85/  0/ 85] r0 (843872)
  7709265    6(11.3)  16.118 ->    6(578.0)     175502  0.034( 6.911)  +   0.015 =  11% (48.6) [ 85/  0/ 85] r0 (A21688)
  7709266    6(11.3)  10.625 ->    0(558.5)          0  0.002( 6.653)  +   0.003 =   1% (44.1) [ 85/  1/ 86] r0 (026653)
  7709267    7(11.3)  11.318 ->    6(539.4)     194537  0.337( 6.446)  +   0.019 =   4% (41.9) [ 86/  0/ 86] r0 (2A3AC9)
  7709268   10(11.3)   7.278 ->    6(520.1)     175502  0.042( 6.212)  +   0.004 =   0% (37.4) [ 86/  0/ 86] r1 (2A3AC9)
  7709269    6(11.3)  11.994 ->    6(463.9)     188192  0.043( 5.678)  +   0.023 =   0% (35.1) [ 87/  0/ 87] r0 (0CCD5C)
  7709270   11(11.6)   6.819 ->    6(397.2)     175502  0.040( 5.123)  +   0.007 =   0% (31.3) [ 87/  0/ 87] r1 (0CCD5C)
  7709271    6(11.3)   6.994 ->    6(319.8)     175502  0.043( 3.983)  +   0.005 =   0% (26.5) [ 87/  0/ 87] r0 (0DB33A)
  7709272    6(11.1)   7.102 ->    0(249.8)          0  0.003( 3.344)  +   0.004 =   0% (21.7) [ 87/  0/ 87] r0 (5F329F)
  7709273    6(10.3)   6.982 ->    0(194.2)          0  0.003( 2.339)  +   0.006 =   0% (16.8) [ 87/  1/ 88] r0 (23E1B0)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709274    6( 9.8)   7.113 ->    0(123.7)          0  0.006( 1.779)  +   0.003 =   0% (12.0) [ 88/  1/ 89] r0 (BB5E2C)
  7709275    6( 9.1)   7.395 ->    0( 49.5)          0  0.002( 1.160)  +   0.003 =   0% ( 7.3) [ 89/  0/ 89] r0 (3B4946)
  7709276    6( 8.9)   7.408 ->    6(  2.7)     181847  0.045( 0.035)  +   0.006 =   0% ( 2.5) [ 89/  0/ 89] r0 (A177B0)
  7709277    6( 8.7)  12.279 ->    6(  2.7)     175502  0.076( 0.036)  +   0.023 =   0% ( 2.0) [ 89/  0/ 89] r0 (CFC45F)

We'll start by examining block 7709165 (which recorded 240M computrons), and 7709213 (250Mc), and 7709243 (280Mc). It's possible that the Big JS Hammer evaluation included a single delivery that took a very long time to complete (not yielding it's time slice until well beyond the per-block limit of 65Mc), followed by a long run-out of additional work (which was properly segmented into separate blocks). And maybe this happened three times.

warner · 2022-12-05T20:41:34Z

We've identified three potential issues:

1: the core-bootstrap upgrade message provoked the creation of 6 vats, which might be more than it was supposed to create (p.s. this is by design)
2: the runPolicy (which limits the amount of work we do in each block, by ending the block after a "bean counter" has grown beyind the configured limit) is apparently not counting the create-vat / startVat events properly (cosmic-swingset runPolicydoes not pay attention to startVat computron usage #6639)
- each swingset computron counts as 100 beans
- the per-block limit is 6.5B beans (which could be filled by 65M computrons)
  - run agd query swingset params and look for beans_per_unit.blockComputeLimit)
- apparently each new vat counts for 30M beans (beans_per_unit.vatCreation)
- but, the actual delivery took about 92M computrons
- (we only started revealing this number about 6-9 months ago, when we introduced the startVat delivery message)
- what we want to happen is that the first startVat spends its 92Mc, then the runPolicy says "ok time to end that block", then the second startVat runs in the second block, then all the remaining work happens in subsequent blocks
- that would have spread the work out into smaller pieces, avoiding the worst of the delays
3: I'm seeing a huge (hundreds) of notifies/deliveries after the new vats are created, all of which talk about Purses and balances of the new denominations (introducing new vbank assets results in O(n) work to update smart wallets #6652)
- I'm worried that we're somehow iterating through all registered/provisioned vat-wallet accounts and notifying everybody that they have new zero-balance Purses available

dckc · 2022-12-05T20:51:37Z

I'm worried that we're somehow iterating through all registered/provisioned vat-wallet accounts and notifying everybody that they have new zero-balance Purses available

Yes... each smart wallet is waiting for new vbank assets such as DAI_axl and DAI_grv:

agoric-sdk/packages/smart-wallet/src/smartWallet.js

Lines 523 to 524 in 2c812d2

    
           // watch the bank for new issuers to make purses out of 
        
           void observeIteration(E(bank).getAssetSubscription(), {

notifiers / subscriptions adds a new wrinkle to the "only do work proportional to the size of the incoming message" design constraint.

cc @michaelfig @turadg @erights

dckc · 2022-12-06T01:23:28Z

voting period ended 2022-12-06T18:31:37Z

proposal 17 on NG has a pretty clean view of the code that started running.
Or try agd --node=https://main.rpc.agoric.net:443 query gov proposal 17.

6 new vats: 3 ea. for DAI_axl, DAI_grv

startPSM.makeAnchorAsset(noMinting, { options: { anchorOptions } }) starts a mintHolder
startPSM.startPSM(permittedPowers, config) starts a contractGovernor
" starts a psm

2 new purses for each of ~400 wallets

analytics.inter.trade shows over 400 smart wallets provisioned. Each of them was waiting to hear about new vbank assets:

agoric-sdk/packages/smart-wallet/src/smartWallet.js

Lines 523 to 524 in 2c812d2

    
           // watch the bank for new issuers to make purses out of 
        
           void observeIteration(E(bank).getAssetSubscription(), {

block times spike but eventually recover

nifty chart of block times and missing validators (ack: polkachu)

warner · 2022-12-06T02:03:29Z

The Munin graphs on a follower node show the state-size growth when the proposal was executed:

6 new XS heap snapshots, one per new vat:

but the new vats are small, so the total size consumed by heap snapshots only grew by about 20MB (compressed, stored as plain files):

The initial/startup transcripts for those new vats consumed about 150MB (uncompressed, stored in SQLite) (note: the transcript space normally grows by about 40MB per day)

The kvStore size grew by 12MB:

(for comparison, on a node that prunes rather aggressively, the cosmos-side state directory normally grows by about 780MB/day)

warner · 2022-12-06T02:19:13Z

The blocks that included a startVat consumed a lot of computrons (which the runPolicy didn't correctly account for), but they didn't actually require a lot of runtime.

The first block was 7709165, which handled the proposal execution. It consumed 240M computrons ("240Mc"), and 20.942s of swingset execution time. The two startVats inside it each took 88Mc and 1.8s. The runPolicy bug meant the startVats only counted 300kc towards the block's 65Mc limit, so the remaining 64.4Mc were spent on other deliveries (which required 17.34s). That's not great but not too bad, leading to a block time of maybe 25s-27s (which, because of the way blockTime is assigned to a later block, appears in my output two blocks later, as the 7709167 25 line)

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709165    6( 6.2)   7.069 ->  950( 50.1)  239910012 20.942( 1.099)  +   2.592 =  79% ( 4.9) [  0/  0/  0] r0 ()

  7709166    6( 6.2)  26.562 -> 1171(108.7)   65050750 31.495( 2.674)  +   0.021 =  94% ( 9.6) [  0/  0/  0] r0 ()
  7709167   25( 7.2)  37.407 -> 1022(159.4)   65101294 19.384( 3.641)  +   0.006 =  81% (13.6) [  0/  0/  0] r0 ()

This is followed by a series of blocks which each use the full 65Mc. The amount of wallclock time they take varies, however, from as little as 4s, to a lot of 20-30s blocks, a pair of 55s, and a single 122s outlier.

  7709168   36( 8.7)  22.958 -> 1321(225.5)   65111031 28.176( 5.050)  +   0.003 =  92% (18.2) [  0/  0/  0] r0 ()
  7709169   22( 9.5)  35.419 -> 1235(287.2)   65027410 122.143(11.157)  +   0.033 =  95% (23.0) [  0/  0/  0] r0 ()

Each block is using the expected 65M-ish computrons, but some kinds of deliveries take more wallclock time than others (we always hoped the computrons/second ratio would be roughly constant; this is what happens when it is not, and when our mix of delivery types is not random enough to even things out). So some blocks take a lot more time than others, and of course we can't just end a block after five seconds without losing determinism.

I'm still digging into the longer blocks. I think we're seeing the same "long time to serialize" problem that we observed in an earlier testnet, which we thought we fixed by updating balances less frequently. But if we're updating a whole bunch of wallet balances at the same time, we're going to be performing the same kind of serialization in a row, which exacerbates the uneven computrons/second ratio problem.

p.s. see

computron limit failed to keep swingset compute time in wall-clock bounds #6857

warner · 2022-12-06T02:28:44Z

BTW, the total proposal execution (which introduced two denoms) required:

65_258 cranks (both normal delivery cranks, and the trivial routing cranks, probably 50/50)
95 blocks
6.5Gc (6_574_772_805 computrons)
1642s (27.5 min) of swingset time
2105s (35.1 min) of wallclock time

We're guessing this is proportional to (number of denoms introduced) x (number of provisioned wallet accounts). So one (slight) mitigation is to only add one denom at a time, which would give the chain more time to recover before taking on the second half.

With 410 wallets provisioned, the execution requirements for a single denom will probably be:

80 cranks
0.1 blocks
8Mc
2.0s swingset time
2.5s wallclock

per provisioned wallet.

mhofman · 2022-12-09T04:11:19Z

FYI, I've confirmed a slowdown of XS workers over time, which probably contributed to this issue. This specific degradation should not occur and we're investigating it. That said it's not a cause in this case, and probably just exacerbated the core problems identified above.

dckc · 2022-12-12T20:33:17Z

We've identified three potential issues:

I recall another from our discussions:

The per-block limit of 65Mc was designed to represent 30 seconds of wall-clock time, but in one case it lasted 122.143 sec, which caused something of a storm in the consensus protocol.
- While it's somewhat unsurprising that computrons don't faithfully represent wall-clock time, the hope that some random variation would smoothing things out didn't pan out.
- Do we have significant gaps in our cost accounting? (chainStorage writes come to mind)

warner · 2023-01-13T00:22:35Z

We had a postmortem meeting today, some of the conclusions:

there were two somewhat-distinct problems:
- 1: large amount of work triggered by surprisingly small action: this made the chain unavailable for new work for 35-ish minutes
- 2: excessive block times in a few early blocks: this caused some validators to fall behind, and made it look like the chain had halted entirely for a little while

The slowdown was caused by a multiplication of several factors, all of which are worth investigating and fixing:

doing anything per-wallet caused a lot more work to be executed than we'd expected: each wallet did operations A/B/C/D/E, times a fan-out of 400x (introducing new vbank assets results in O(n) work to update smart wallets #6652)
XS metering is not precise: some "big" computrons take more wallclock time than other "small" ones: computron limit failed to keep swingset compute time in wall-clock bounds #6857
- this means the 65Mc/block limit doesn't always meet the 5-10s of computation goal: sometimes it ends the block too early, sometimes too late
the breadth-first scheduling of that work (400x "A" steps, followed by 400x "B" steps, etc) causes a lot of identical operations to be done in a big batch, exacerbating the big-vs-small computron problem
- if we did one A/B/C/D/E, then the second A/B/C/D/E, then the third, etc, we'd get more statistical averaging of big/small computrons, making the 65Mc/block limit more meaningful
there might have been additional slowdowns due to organic GC time, we're not sure
- if so, this would probably be a consequence of the vats being "large" (lots of objects), which is a consequence of insufficient use of virtual objects on high-cardinality data, like wallets
there were certainly slowdowns due to the XS "dusty worker degradation" problem (Slowdown of XS worker #6661), which doesn't increase the number of computrons consumed, but does increase the wallclock required, and most validators had not rebooted recently
- this could also exacerbate the difference between validators: recently-rebooted "clean" ones might have gotten through those large blocks more quickly than non-recently-rebooted "dusty" ones

To address the per-wallet work, we need to make things more lazy, and/or look for ways to just do less work per vat.

To address the XS metering precision, we should sort the cranks by their wallclock / computrons ratio, then look carefully at the biggest vs the smallest, and try to figure out which low-level JS operations were involved in each. Perhaps we could run a lot of the same-sized deliveries in a loop, while we point a profiler at the worker, and build up a flamegraph of which XS .c functions get called during each. If we're lucky there may be a few specific XS functions that account for lots of wallclock but not a lot of computrons, and we can change their metering behavior to increase the computron consumption. Likewise there might be code that increments the meter a lot but actually runs very quickly, which could use adjusting in the opposite direction.

The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.

We need to replay these vats and add GC instrumentation measure how frequently organic GC is happening, how long it takes, and whether this time was a significant contributor to the wallclock/computron variance. We might address this through XS changes, but my hunch is that finishing virtualization of the vats involved will make the most difference, to shrink the JS heaps and keep GC fast.

Anything we do to speed things up or do less work will help with problem 1 (the chain being unavailable for 35 minutes). Users care more about that.

Anything we can do to make execution more uniform (and make the per-block computron limit more effective) will help with problem 2 (the handful of long blocks, causing voting/consensus/stability problems). Validators care more about that.

mhofman · 2023-01-13T01:29:47Z

The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.

I believe @dtribble's suggestion to push new work generated by a vat on the top of the runqueue instead of the bottom would have effectively prevented this breadth-first issue.

dckc · 2023-01-23T19:25:33Z

historical note: it looks like we knew about #6652 as far back as May 2022

refine bankManager / VBANK to provide purses on demand #5412

We just didn't schedule that work for the release.

warner self-assigned this Dec 5, 2022

warner mentioned this issue Dec 6, 2022

cosmic-swingset runPolicydoes not pay attention to startVat computron usage #6639

Open

This was referenced Dec 7, 2022

feat: integrate JSON.stringify/parse in board marshallers #6646

Draft

introducing new vbank assets results in O(n) work to update smart wallets #6652

Closed

mhofman mentioned this issue Dec 9, 2022

anachrophobia strikes vat v18 on restarting agoric on mainnet #6588

Closed

mhofman mentioned this issue Dec 11, 2022

Slowdown of XS worker #6661

Open

otoole-brendan added the vaults_triage DO NOT USE label Dec 20, 2022

dckc added the performance Performance related issues label Jan 13, 2023

dckc mentioned this issue Jan 25, 2023

computron limit failed to keep swingset compute time in wall-clock bounds #6857

Open

mhofman mentioned this issue Feb 8, 2023

Force xsnap reload from snapshot after writing each snapshot #6943

Closed

dckc added the chain-incident label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

warner commented Dec 5, 2022 •

edited

Loading

warner commented Dec 5, 2022 •

edited

Loading

warner commented Dec 5, 2022 •

edited by dckc

Loading

dckc commented Dec 5, 2022 •

edited

Loading

dckc commented Dec 6, 2022

warner commented Dec 6, 2022

warner commented Dec 6, 2022 •

edited by dckc

Loading

warner commented Dec 6, 2022

mhofman commented Dec 9, 2022

dckc commented Dec 12, 2022

warner commented Jan 13, 2023 •

edited by dckc

Loading

mhofman commented Jan 13, 2023

dckc commented Jan 23, 2023

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

Comments

warner commented Dec 5, 2022 • edited Loading

warner commented Dec 5, 2022 • edited Loading

warner commented Dec 5, 2022 • edited by dckc Loading

dckc commented Dec 5, 2022 • edited Loading

dckc commented Dec 6, 2022

voting period ended 2022-12-06T18:31:37Z

6 new vats: 3 ea. for DAI_axl, DAI_grv

2 new purses for each of ~400 wallets

block times spike but eventually recover

warner commented Dec 6, 2022

warner commented Dec 6, 2022 • edited by dckc Loading

warner commented Dec 6, 2022

mhofman commented Dec 9, 2022

dckc commented Dec 12, 2022

warner commented Jan 13, 2023 • edited by dckc Loading

mhofman commented Jan 13, 2023

dckc commented Jan 23, 2023

warner commented Dec 5, 2022 •

edited

Loading

warner commented Dec 5, 2022 •

edited

Loading

warner commented Dec 5, 2022 •

edited by dckc

Loading

dckc commented Dec 5, 2022 •

edited

Loading

warner commented Dec 6, 2022 •

edited by dckc

Loading

warner commented Jan 13, 2023 •

edited by dckc

Loading