Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

Open
warner opened this issue Dec 5, 2022 · 12 comments
Open

investigating mainnet slowdown around block 7709170 (05-dec-2022) #6625

warner opened this issue Dec 5, 2022 · 12 comments
Assignees
Labels
chain-incident performance Performance related issues vaults_triage DO NOT USE

Comments

@warner
Copy link
Member

warner commented Dec 5, 2022

We're investigating a slowdown of the mainnet chain that happend around block 7709170 , about 10:30am PST on monday 05-dec-2022. Several validators fell behind during this time, with one block being delayed by over two minutes.

Preliminary investigating suggests that a governance proposal passed and became active about this time, causing a large chunk of JS to be executed via the "Big JS Hammer" (core-bootstrap eval) upgrade pathway. This can be expected to take non-trivial time to complete, however the work should have been broken up into multiple blocks. It's possible that the computron limit was not being enforced correctly, allowing the block to take longer than it should have.

We'll collect information and findings in this ticket.

@warner
Copy link
Member Author

warner commented Dec 5, 2022

Here's the output of packages/SwingSet/misc-tools/monitor-slog-block-time.py on the blocks immediately before and after the problem:

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709155    6( 6.5)   6.494 ->    6(  3.0)     188192  0.047( 0.071)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709156    6( 6.5)   6.829 ->    0(  2.6)          0  0.003( 0.069)  +   0.015 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709157    6( 6.5)   6.730 ->    0(  2.6)          0  0.003( 0.069)  +   0.005 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709158    6( 6.5)   6.707 ->    0(  2.6)          0  0.002( 0.069)  +   0.003 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709159    6( 6.5)   6.698 ->    0(  2.4)          0  0.003( 0.067)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709160    6( 6.2)   6.680 ->    6(  2.6)     175502  0.047( 0.069)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709161    6( 6.2)   6.706 ->    6(  3.0)     188192  0.045( 0.071)  +   0.007 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709162    6( 6.2)   6.697 ->    0(  3.0)          0  0.003( 0.071)  +   0.003 =   0% ( 1.0) [  0/  0/  0] r0 ()
  7709163    6( 6.2)  11.264 ->    0(  2.6)          0  0.002( 0.052)  +   0.003 =   0% ( 0.9) [  0/  0/  0] r0 ()
  7709164   10( 6.2)   7.137 ->    0(  2.6)          0  0.002( 0.052)  +   0.006 =   0% ( 0.9) [  0/  0/  0] r0 ()
  7709165    6( 6.2)   7.069 ->  950( 50.1)  239910012 20.942( 1.099)  +   2.592 =  79% ( 4.9) [  0/  0/  0] r0 ()
  7709166    6( 6.2)  26.562 -> 1171(108.7)   65050750 31.495( 2.674)  +   0.021 =  94% ( 9.6) [  0/  0/  0] r0 ()
  7709167   25( 7.2)  37.407 -> 1022(159.4)   65101294 19.384( 3.641)  +   0.006 =  81% (13.6) [  0/  0/  0] r0 ()
  7709168   36( 8.7)  22.958 -> 1321(225.5)   65111031 28.176( 5.050)  +   0.003 =  92% (18.2) [  0/  0/  0] r0 ()
  7709169   22( 9.5)  35.419 -> 1235(287.2)   65027410 122.143(11.157)  +   0.033 =  95% (23.0) [  0/  0/  0] r0 ()
  7709170   34(10.9)  143.230 ->  652(319.9)   65181729 11.763(11.745)  +   0.197 =  37% (24.9) [  0/  0/  0] r0 ()
  7709171   142(17.7)  17.811 ->  654(351.1)   65342884 74.501(15.428)  +   0.046 =  94% (29.0) [  0/  0/  0] r0 ()
  7709172   17(18.2)  80.691 ->  654(383.9)   65342884 24.169(16.637)  +   0.023 =  81% (33.1) [  0/  0/  0] r0 ()
  7709173   80(21.9)  29.061 ->  684(418.1)   65276501 29.048(18.089)  +   0.006 =  87% (37.4) [  0/  0/  0] r0 ()
  7709174   28(23.0)  31.720 ->  710(453.6)   65053869 54.125(20.795)  +   0.005 =  97% (42.3) [  0/  0/  0] r0 ()
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709175   31(24.2)  66.002 ->  710(488.8)   65024668 29.351(22.260)  +   0.075 =  72% (45.9) [  0/  0/  0] r0 ()
  7709176   65(27.2)  34.031 ->  710(524.2)   65335032 55.021(25.011)  +   0.006 =  93% (50.6) [  0/  0/  0] r0 ()
  7709177   33(28.6)  66.113 ->  740(561.2)   65120336 14.889(25.756)  +   0.008 =  59% (53.5) [  0/  0/  0] r0 ()
  7709178   65(31.5)  22.919 ->  734(598.0)   65324306 20.057(26.758)  +   0.021 =  74% (57.2) [  0/  0/  0] r0 ()
  7709179   22(32.3)  25.699 ->  736(634.8)   65077924 30.291(28.273)  +   0.011 =  86% (61.6) [  0/  0/  0] r0 ()
  7709180   25(33.2)  34.337 ->  651(667.0)   65132172 25.147(29.528)  +   0.004 =  88% (65.9) [  0/  0/  0] r0 ()
  7709181   33(34.6)  33.710 ->  603(696.9)   65353411 27.246(30.888)  +   0.005 =  79% (69.8) [  0/  0/  0] r0 ()
  7709182   33(36.0)  45.166 ->  603(727.0)   65348874 25.336(32.154)  +   0.057 =  59% (72.8) [  0/  0/  0] r0 ()
  7709183   44(37.9)  38.478 ->  603(757.1)   65351449 27.391(33.524)  +   0.029 =  69% (76.3) [  0/  0/  0] r0 ()
  7709184   37(39.2)  33.069 ->  603(787.3)   65358283 29.154(34.981)  +   0.009 =  87% (80.7) [  0/  0/  0] r0 ()
  7709185   32(40.5)  36.877 ->  603(770.0)   65361309 28.131(35.341)  +   0.012 =  80% (80.7) [  0/  0/  0] r0 ()
  7709186   36(42.0)  32.898 -> 1070(764.9)   65035483 24.044(34.968)  +   0.005 =  86% (80.3) [  0/  0/  0] r0 ()
  7709187   32(42.4)  26.881 -> 1671(797.4)   65080484 32.178(35.608)  +   0.016 =  94% (81.0) [  0/  0/  0] r0 ()
  7709188   26(41.9)  39.221 -> 1671(814.9)   65080484 18.501(35.124)  +   0.050 =  75% (80.1) [  0/  0/  0] r0 ()
  7709189   38(42.6)  46.131 ->  499(778.0)   65642797 36.767(30.855)  +   0.011 =  58% (78.2) [  0/  0/  0] r0 ()
  7709190   45(43.2)  56.552 ->  485(769.7)   65598691 18.171(31.176)  +   0.005 =  49% (78.8) [  0/  0/  0] r0 ()
  7709191   55(38.9)  31.413 ->  484(761.2)   65591264 14.048(28.153)  +   0.022 =  54% (76.8) [  0/  0/  0] r0 ()
  7709192   30(39.5)  21.235 ->  484(752.7)   65587795 12.058(27.548)  +   0.006 =  67% (76.1) [  0/  0/  0] r0 ()
  7709193   20(36.5)  14.711 ->  484(742.7)   65589535 12.031(26.697)  +   0.003 =  89% (76.2) [  0/  0/  0] r0 ()
  7709194   14(35.8)  18.415 ->  484(731.4)   65590781 10.169(24.499)  +   0.059 =  64% (74.6) [  0/  0/  0] r0 ()
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709195   12(34.9)  27.489 ->  484(720.1)   65592283 20.771(24.070)  +   0.028 =  65% (74.2) [  0/  0/  0] r0 ()
  7709196   14(32.3)  45.895 ->  485(708.9)   65600037 11.029(21.870)  +   0.022 =  48% (72.0) [  0/  0/  0] r0 ()

blocks continue to be full (about 65M computrons each) until block 7709257, where the queue is finally drained and we're able to catch up:

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709197   39(19.8)  22.831 ->  484(363.2)   65596431  9.008(12.744)  +   0.005 =  64% (60.9) [ 68/  0/ 68] r3 (BB5E2C)
  7709198   22(20.2)  15.073 ->  561(402.8)   65563866 21.843(14.564)  +   0.003 =  80% (64.9) [ 72/  0/ 72] r2 (BB5E2C)
  7709199   14(19.2)  24.778 ->  619(438.8)   65005698  8.583(13.567)  +   0.003 =  82% (67.8) [ 65/  0/ 65] r1 (BB5E2C)
  7709200   19(19.1)  16.110 ->  617(464.3)   65285790  8.785(12.884)  +   0.006 =  83% (70.0) [ 73/  0/ 73] r0 (2A3AC9)
  7709201   10(18.0)  20.463 ->  619(483.6)   65005545  8.849(12.380)  +   0.027 =  61% (68.9) [ 71/  0/ 71] r0 (CFC45F)
  7709202   15(17.7)  15.666 ->  618(498.6)   65290598  8.964(12.000)  +   0.003 =  87% (71.0) [ 71/  0/ 71] r1 (CFC45F)
  7709203   10(16.9)  15.298 ->  619(510.6)   65005683  8.846(11.685)  +   0.022 =  93% (73.2) [ 73/  0/ 73] r0 (E29E61)
  7709204   10(16.3)  20.500 ->  617(520.3)   65284344  8.575(11.402)  +   0.015 =  57% (71.7) [ 71/  0/ 71] r0 (23E1B0)
  7709205   14(16.1)  15.608 ->  619(528.5)   65006237  9.291(11.226)  +   0.003 =  94% (73.7) [ 71/  0/ 71] r1 (23E1B0)
  7709206   10(15.6)  29.852 ->  818(550.8)   65119238 25.071(12.291)  +   0.015 =  62% (72.8) [ 72/  0/ 72] r0 (47C2F8)
  7709207   15(15.6)  49.541 -> 1002(583.0)   65108717 28.765(13.468)  +   0.049 =  75% (73.0) [ 67/  0/ 67] r1 (BF9BF3)
  7709208   32(16.7)  46.817 -> 1005(611.1)   65016753 18.499(13.803)  +   0.047 =  97% (74.6) [ 65/  0/ 65] r2 (BF9BF3)
  7709209   33(17.7)  48.654 -> 1006(635.8)   65012879 35.034(15.130)  +   0.011 =  68% (74.2) [ 72/  0/ 72] r1 (996453)
  7709210   28(18.3)  56.596 ->  966(655.2)   65178791 27.960(15.885)  +   0.020 =  96% (75.5) [ 70/  0/ 70] r2 (996453)
  7709211   40(19.5)  45.159 ->  924(670.2)   65002751 10.241(15.571)  +   0.004 =  94% (76.6) [ 70/  0/ 70] r1 (BB5E2C)
  7709212   35(20.3)  21.085 ->  924(683.5)   65004071 10.268(15.292)  +   0.003 =  93% (77.5) [ 71/  0/ 71] r1 (BB5E2C)
  7709213   12(19.9)  40.445 ->  542(676.5)  250284328 44.655(16.760)  +   0.006 =  67% (77.0) [ 73/  0/ 73] r0 (9A4CCB)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709214   15(19.9)  71.403 ->  416(697.2)   65365580  8.509(16.677)  +   0.059 =  86% (78.2) [ 71/  0/ 71] r1 (026653)
  7709215   49(21.8)  39.396 ->  415(693.8)   65358871  7.060(15.992)  +   0.003 =  45% (77.2) [ 76/  0/ 76] r1 (026653)
  7709216    9(21.6)  37.957 ->  415(690.3)   65357798  7.450(15.813)  +   0.004 =  93% (79.4) [ 72/  0/ 72] r0 (3B6773)
  7709217    8(20.0)  37.968 ->  415(686.9)   65358636 20.962(16.410)  +   0.045 =  97% (81.0) [ 76/  0/ 76] r0 (CFC45F)
  7709218    9(19.4)  52.596 ->  408(679.2)   65059232  5.153(15.576)  +   0.008 =  66% (80.3) [ 67/  0/ 67] r0 (E34C31)
  7709219   24(19.9)  42.313 ->  405(668.5)   65404941  4.542(15.374)  +   0.006 =  34% (77.9) [ 75/  0/ 75] r1 (E34C31)
  7709220    6(19.2)  41.316 ->  405(657.9)   65404691  4.611(15.165)  +   0.009 =  91% (78.3) [ 77/  0/ 77] r0 (E14F81)
  7709221    7(19.1)  43.266 ->  405(647.2)   65403452  4.622(14.954)  +   0.005 =  51% (77.8) [ 77/  0/ 77] r0 (27C9BF)
  7709222    6(18.6)  42.273 ->  405(636.5)   65404398  4.466(14.729)  +   0.005 =  92% (78.1) [ 76/  0/ 76] r0 (E29E61)
  7709223    6(18.4)  41.145 ->  940(652.6)   65072377 19.573(15.265)  +   0.003 =  97% (78.3) [ 77/  0/ 77] r0 (2A3AC9)
  7709224    6(18.2)  57.550 -> 1238(683.6)   65004024 12.381(15.456)  +   0.003 =  81% (79.5) [ 71/  0/ 71] r0 (BB5E2C)
  7709225   18(18.4)  56.115 -> 1405(723.0)   65005301 11.064(15.544)  +   0.021 =  72% (78.4) [ 70/  0/ 70] r0 (D97CBE)
  7709226   18(18.8)  52.699 ->  551(709.6)   65016601  6.953(14.638)  +   0.006 =  66% (78.6) [ 78/  0/ 78] r1 (D97CBE)
  7709227   10(18.6)  50.483 ->  336(676.3)   65316156  5.489(13.475)  +   0.017 =  86% (79.2) [ 74/  0/ 74] r0 (A21688)
  7709228    7(17.3)  49.547 ->  336(642.9)   65315525 15.434(13.321)  +   0.005 =  96% (79.2) [ 77/  0/ 77] r0 (A177B0)
  7709229   12(16.2)  55.343 ->  337(609.4)   65321205  5.984(11.869)  +   0.015 =  71% (79.3) [ 61/  0/ 61] r1 (765A78)
  7709230   19(15.8)  42.847 ->  336(577.9)   65315624  5.864(10.764)  +   0.004 =  92% (79.1) [ 69/  0/ 69] r1 (765A78)
  7709231    8(14.2)  41.466 ->  385(551.0)   65191168  4.897(10.497)  +   0.009 =  86% (78.7) [ 82/  0/ 82] r0 (BB89FD)
  7709232    6(12.8)  43.377 ->  390(524.2)   65708481  4.795(10.223)  +   0.005 =  61% (77.1) [ 79/  0/ 79] r0 (0DB33A)
  7709233    6(12.4)  55.678 ->  391(516.7)   65714431  5.134( 8.247)  +   0.029 =  27% (75.1) [ 80/  0/ 80] r0 (CFC45F)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709234    7(12.1)  54.316 ->  391(515.5)   65713231  4.657( 8.055)  +   0.003 =  90% (75.3) [ 71/  0/ 71] r0 (5F329F)
  7709235    6( 9.9)  53.643 ->  390(514.2)   65708086  4.610( 7.932)  +   0.012 =  87% (77.3) [ 80/  1/ 81] r0 (843872)
  7709236   11(10.0)  47.731 ->  943(540.6)   65126968 10.045( 8.062)  +   0.006 =  95% (77.4) [ 67/  0/ 67] r1 (843872)
  7709237    8(10.0)  50.380 -> 1287(584.2)   65398089 22.792( 8.153)  +   0.004 =  97% (77.4) [ 69/  0/ 69] r0 (BB5E2C)
  7709238    8( 9.9)  73.646 -> 1642(645.9)   65050770 10.633( 8.427)  +   0.016 =  55% (76.9) [ 75/  0/ 75] r0 (C22CB0)
  7709239   20( 9.8)  65.480 ->  336(642.5)   65445662  6.789( 8.540)  +   0.057 =  85% (79.4) [ 73/  0/ 73] r0 (06AB35)
  7709240   15(10.2)  58.000 ->  336(639.0)   65314320  5.479( 8.583)  +   0.004 =  89% (79.3) [ 75/  0/ 75] r1 (06AB35)
  7709241    7(10.2)  57.175 ->  336(635.5)   65313635  6.510( 8.678)  +   0.030 =  90% (81.3) [ 75/  0/ 75] r0 (9E6A8A)
  7709242    6(10.2)  58.190 ->  336(632.1)   65313202  6.688( 8.789)  +   0.007 =  93% (81.3) [ 75/  0/ 75] r0 (E29E61)
  7709243    6(10.2)  61.644 ->  342(602.2)  278147375 21.960( 8.908)  +   0.005 =  88% (80.9) [ 75/  0/ 75] r0 (A9F209)
  7709244    7(10.2)  77.019 ->  386(559.6)   65570782  4.479( 8.513)  +   0.029 =  91% (81.4) [ 74/  0/ 74] r0 (2A3AC9)
  7709245   21(10.4)  71.150 ->  390(508.9)   65708256  4.443( 8.182)  +   0.023 =  29% (79.2) [ 74/  0/ 74] r1 (2A3AC9)
  7709246    6( 9.8)  70.097 ->  390(500.8)   65707574  5.156( 8.092)  +   0.016 =  91% (80.5) [ 75/  1/ 76] r0 (BCF291)
  7709247    7( 9.7)  73.149 ->  390(503.5)   65707249  4.486( 8.042)  +   0.005 =  47% (78.5) [ 76/  0/ 76] r0 (CFC45F)
  7709248   11( 9.8)  67.156 ->  391(506.2)   65714344  4.719( 7.506)  +   0.005 =  90% (78.2) [ 79/  0/ 79] r1 (CFC45F)
  7709249    6( 9.6)  78.223 -> 1131(546.0)   65078323 10.731( 7.743)  +   0.003 =  46% (77.0) [ 78/  0/ 78] r0 (BB5E2C)
  7709250    6( 8.9)  86.226 -> 1339(596.1)   65038299 11.131( 8.007)  +   0.003 =  77% (76.2) [ 70/  0/ 70] r0 (8BA38C)
  7709251   11( 9.1)  87.041 -> 1554(654.5)   65029239 22.834( 8.904)  +   0.005 =  97% (76.7) [ 73/  0/ 73] r0 (0DB33A)
  7709252   10( 9.2)  100.462 -> 1399(705.0)   65051279 12.786( 9.303)  +   0.034 =  95% (78.4) [ 73/  0/ 73] r0 (0FED4E)
  7709253   22(10.1)  91.695 -> 1114(741.1)   65069571 20.106(10.052)  +   0.017 =  97% (82.0) [ 73/  0/ 73] r1 (DC4CCA)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709254   16(10.5)  96.330 -> 1409(792.0)   65086102 11.202(10.379)  +   0.017 =  95% (82.2) [ 73/  0/ 73] r1 (1CE894)
  7709255   21(11.2)  87.210 -> 1485(846.8)   65039781 12.390(10.768)  +   0.027 =  94% (82.6) [ 74/  0/ 74] r1 (1CE894)
  7709256    9(11.2)  91.243 ->  941(846.7)   61174777 22.553(11.393)  +   0.003 =  97% (82.7) [ 74/  0/ 74] r0 (4EBC59)
  7709257   11(11.3)  103.351 ->    6(782.6)     219917  0.040(10.256)  +   0.025 =  10% (78.3) [ 73/  0/ 73] r0 (E29E61)
  7709258   39(12.8)  64.821 ->    0(700.5)          0  0.002( 9.724)  +   0.005 =   1% (75.6) [ 80/  2/ 82] r3 (32B4F1)
  7709259    6(12.2)  59.245 ->    0(683.8)          0  0.002( 9.385)  +   0.005 =   1% (71.4) [ 82/  0/ 82] r0 (E29E61)
  7709260   11(11.9)  48.725 ->    6(667.2)     188192  0.037( 9.113)  +   0.008 =   8% (67.4) [ 82/  0/ 82] r1 (E29E61)
  7709261    6(11.9)  43.447 ->    0(650.5)          0  0.002( 8.787)  +   0.003 =   0% (62.9) [ 82/  1/ 83] r0 (BB5E2C)
  7709262   11(12.2)  32.892 ->    0(633.6)          0  0.002( 8.453)  +   0.003 =   1% (58.3) [ 83/  1/ 84] r1 (BB5E2C)
  7709263    6(12.2)  27.301 ->    0(616.5)          0  0.002( 7.355)  +   0.009 =   2% (54.0) [ 84/  0/ 84] r0 (CFC45F)
  7709264    6(12.1)  21.721 ->    0(597.2)          0  0.002( 7.131)  +   0.003 =   1% (49.5) [ 85/  0/ 85] r0 (843872)
  7709265    6(11.3)  16.118 ->    6(578.0)     175502  0.034( 6.911)  +   0.015 =  11% (48.6) [ 85/  0/ 85] r0 (A21688)
  7709266    6(11.3)  10.625 ->    0(558.5)          0  0.002( 6.653)  +   0.003 =   1% (44.1) [ 85/  1/ 86] r0 (026653)
  7709267    7(11.3)  11.318 ->    6(539.4)     194537  0.337( 6.446)  +   0.019 =   4% (41.9) [ 86/  0/ 86] r0 (2A3AC9)
  7709268   10(11.3)   7.278 ->    6(520.1)     175502  0.042( 6.212)  +   0.004 =   0% (37.4) [ 86/  0/ 86] r1 (2A3AC9)
  7709269    6(11.3)  11.994 ->    6(463.9)     188192  0.043( 5.678)  +   0.023 =   0% (35.1) [ 87/  0/ 87] r0 (0CCD5C)
  7709270   11(11.6)   6.819 ->    6(397.2)     175502  0.040( 5.123)  +   0.007 =   0% (31.3) [ 87/  0/ 87] r1 (0CCD5C)
  7709271    6(11.3)   6.994 ->    6(319.8)     175502  0.043( 3.983)  +   0.005 =   0% (26.5) [ 87/  0/ 87] r0 (0DB33A)
  7709272    6(11.1)   7.102 ->    0(249.8)          0  0.003( 3.344)  +   0.004 =   0% (21.7) [ 87/  0/ 87] r0 (5F329F)
  7709273    6(10.3)   6.982 ->    0(194.2)          0  0.003( 2.339)  +   0.006 =   0% (16.8) [ 87/  1/ 88] r0 (23E1B0)
- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709274    6( 9.8)   7.113 ->    0(123.7)          0  0.006( 1.779)  +   0.003 =   0% (12.0) [ 88/  1/ 89] r0 (BB5E2C)
  7709275    6( 9.1)   7.395 ->    0( 49.5)          0  0.002( 1.160)  +   0.003 =   0% ( 7.3) [ 89/  0/ 89] r0 (3B4946)
  7709276    6( 8.9)   7.408 ->    6(  2.7)     181847  0.045( 0.035)  +   0.006 =   0% ( 2.5) [ 89/  0/ 89] r0 (A177B0)
  7709277    6( 8.7)  12.279 ->    6(  2.7)     175502  0.076( 0.036)  +   0.023 =   0% ( 2.0) [ 89/  0/ 89] r0 (CFC45F)

We'll start by examining block 7709165 (which recorded 240M computrons), and 7709213 (250Mc), and 7709243 (280Mc). It's possible that the Big JS Hammer evaluation included a single delivery that took a very long time to complete (not yielding it's time slice until well beyond the per-block limit of 65Mc), followed by a long run-out of additional work (which was properly segmented into separate blocks). And maybe this happened three times.

@warner warner self-assigned this Dec 5, 2022
@warner
Copy link
Member Author

warner commented Dec 5, 2022

We've identified three potential issues:

  • 1: the core-bootstrap upgrade message provoked the creation of 6 vats, which might be more than it was supposed to create (p.s. this is by design)
  • 2: the runPolicy (which limits the amount of work we do in each block, by ending the block after a "bean counter" has grown beyind the configured limit) is apparently not counting the create-vat / startVat events properly (cosmic-swingset runPolicydoes not pay attention to startVat computron usage #6639)
    • each swingset computron counts as 100 beans
    • the per-block limit is 6.5B beans (which could be filled by 65M computrons)
      • run agd query swingset params and look for beans_per_unit.blockComputeLimit)
    • apparently each new vat counts for 30M beans (beans_per_unit.vatCreation)
    • but, the actual delivery took about 92M computrons
    • (we only started revealing this number about 6-9 months ago, when we introduced the startVat delivery message)
    • what we want to happen is that the first startVat spends its 92Mc, then the runPolicy says "ok time to end that block", then the second startVat runs in the second block, then all the remaining work happens in subsequent blocks
    • that would have spread the work out into smaller pieces, avoiding the worst of the delays
  • 3: I'm seeing a huge (hundreds) of notifies/deliveries after the new vats are created, all of which talk about Purses and balances of the new denominations (introducing new vbank assets results in O(n) work to update smart wallets #6652)
    • I'm worried that we're somehow iterating through all registered/provisioned vat-wallet accounts and notifying everybody that they have new zero-balance Purses available

@dckc
Copy link
Member

dckc commented Dec 5, 2022

  • I'm worried that we're somehow iterating through all registered/provisioned vat-wallet accounts and notifying everybody that they have new zero-balance Purses available

Yes... each smart wallet is waiting for new vbank assets such as DAI_axl and DAI_grv:

// watch the bank for new issuers to make purses out of
void observeIteration(E(bank).getAssetSubscription(), {

notifiers / subscriptions adds a new wrinkle to the "only do work proportional to the size of the incoming message" design constraint.

cc @michaelfig @turadg @erights

@dckc
Copy link
Member

dckc commented Dec 6, 2022

voting period ended 2022-12-06T18:31:37Z

proposal 17 on NG has a pretty clean view of the code that started running.
Or try agd --node=https://main.rpc.agoric.net:443 query gov proposal 17.

6 new vats: 3 ea. for DAI_axl, DAI_grv

  1. startPSM.makeAnchorAsset(noMinting, { options: { anchorOptions } }) starts a mintHolder
  2. startPSM.startPSM(permittedPowers, config) starts a contractGovernor
  3. " starts a psm

2 new purses for each of ~400 wallets

analytics.inter.trade shows over 400 smart wallets provisioned. Each of them was waiting to hear about new vbank assets:

// watch the bank for new issuers to make purses out of
void observeIteration(E(bank).getAssetSubscription(), {

block times spike but eventually recover

nifty chart of block times and missing validators (ack: polkachu)

image

@warner
Copy link
Member Author

warner commented Dec 6, 2022

The Munin graphs on a follower node show the state-size growth when the proposal was executed:

6 new XS heap snapshots, one per new vat:

agd_swingset_xs_snapshots-day

but the new vats are small, so the total size consumed by heap snapshots only grew by about 20MB (compressed, stored as plain files):

agd_swingset_xs_size-day

The initial/startup transcripts for those new vats consumed about 150MB (uncompressed, stored in SQLite) (note: the transcript space normally grows by about 40MB per day)

agd_swingset_streamstore_size-day

The kvStore size grew by 12MB:
agd_swingset_kvstore_size-day

(for comparison, on a node that prunes rather aggressively, the cosmos-side state directory normally grows by about 780MB/day)

@warner
Copy link
Member Author

warner commented Dec 6, 2022

The blocks that included a startVat consumed a lot of computrons (which the runPolicy didn't correctly account for), but they didn't actually require a lot of runtime.

The first block was 7709165, which handled the proposal execution. It consumed 240M computrons ("240Mc"), and 20.942s of swingset execution time. The two startVats inside it each took 88Mc and 1.8s. The runPolicy bug meant the startVats only counted 300kc towards the block's 65Mc limit, so the remaining 64.4Mc were spent on other deliveries (which required 17.34s). That's not great but not too bad, leading to a block time of maybe 25s-27s (which, because of the way blockTime is assigned to a later block, appears in my output two blocks later, as the 7709167 25 line)

- block  blockTime    lag  -> cranks(avg) computrons  swingset(avg)  +  cosmos = proc% (avg) [sigs2/3/tot] rounds proposer
  7709165    6( 6.2)   7.069 ->  950( 50.1)  239910012 20.942( 1.099)  +   2.592 =  79% ( 4.9) [  0/  0/  0] r0 ()

  7709166    6( 6.2)  26.562 -> 1171(108.7)   65050750 31.495( 2.674)  +   0.021 =  94% ( 9.6) [  0/  0/  0] r0 ()
  7709167   25( 7.2)  37.407 -> 1022(159.4)   65101294 19.384( 3.641)  +   0.006 =  81% (13.6) [  0/  0/  0] r0 ()

This is followed by a series of blocks which each use the full 65Mc. The amount of wallclock time they take varies, however, from as little as 4s, to a lot of 20-30s blocks, a pair of 55s, and a single 122s outlier.

  7709168   36( 8.7)  22.958 -> 1321(225.5)   65111031 28.176( 5.050)  +   0.003 =  92% (18.2) [  0/  0/  0] r0 ()
  7709169   22( 9.5)  35.419 -> 1235(287.2)   65027410 122.143(11.157)  +   0.033 =  95% (23.0) [  0/  0/  0] r0 ()

Each block is using the expected 65M-ish computrons, but some kinds of deliveries take more wallclock time than others (we always hoped the computrons/second ratio would be roughly constant; this is what happens when it is not, and when our mix of delivery types is not random enough to even things out). So some blocks take a lot more time than others, and of course we can't just end a block after five seconds without losing determinism.

I'm still digging into the longer blocks. I think we're seeing the same "long time to serialize" problem that we observed in an earlier testnet, which we thought we fixed by updating balances less frequently. But if we're updating a whole bunch of wallet balances at the same time, we're going to be performing the same kind of serialization in a row, which exacerbates the uneven computrons/second ratio problem.

p.s. see

@warner
Copy link
Member Author

warner commented Dec 6, 2022

BTW, the total proposal execution (which introduced two denoms) required:

  • 65_258 cranks (both normal delivery cranks, and the trivial routing cranks, probably 50/50)
  • 95 blocks
  • 6.5Gc (6_574_772_805 computrons)
  • 1642s (27.5 min) of swingset time
  • 2105s (35.1 min) of wallclock time

We're guessing this is proportional to (number of denoms introduced) x (number of provisioned wallet accounts). So one (slight) mitigation is to only add one denom at a time, which would give the chain more time to recover before taking on the second half.

With 410 wallets provisioned, the execution requirements for a single denom will probably be:

  • 80 cranks
  • 0.1 blocks
  • 8Mc
  • 2.0s swingset time
  • 2.5s wallclock

per provisioned wallet.

@mhofman
Copy link
Member

mhofman commented Dec 9, 2022

FYI, I've confirmed a slowdown of XS workers over time, which probably contributed to this issue. This specific degradation should not occur and we're investigating it. That said it's not a cause in this case, and probably just exacerbated the core problems identified above.

@dckc
Copy link
Member

dckc commented Dec 12, 2022

We've identified three potential issues:

I recall another from our discussions:

  1. The per-block limit of 65Mc was designed to represent 30 seconds of wall-clock time, but in one case it lasted 122.143 sec, which caused something of a storm in the consensus protocol.
    • While it's somewhat unsurprising that computrons don't faithfully represent wall-clock time, the hope that some random variation would smoothing things out didn't pan out.
    • Do we have significant gaps in our cost accounting? (chainStorage writes come to mind)

@otoole-brendan otoole-brendan added the vaults_triage DO NOT USE label Dec 20, 2022
@warner
Copy link
Member Author

warner commented Jan 13, 2023

We had a postmortem meeting today, some of the conclusions:

  • there were two somewhat-distinct problems:
    • 1: large amount of work triggered by surprisingly small action: this made the chain unavailable for new work for 35-ish minutes
    • 2: excessive block times in a few early blocks: this caused some validators to fall behind, and made it look like the chain had halted entirely for a little while

The slowdown was caused by a multiplication of several factors, all of which are worth investigating and fixing:

  • doing anything per-wallet caused a lot more work to be executed than we'd expected: each wallet did operations A/B/C/D/E, times a fan-out of 400x (introducing new vbank assets results in O(n) work to update smart wallets #6652)
  • XS metering is not precise: some "big" computrons take more wallclock time than other "small" ones: computron limit failed to keep swingset compute time in wall-clock bounds #6857
    • this means the 65Mc/block limit doesn't always meet the 5-10s of computation goal: sometimes it ends the block too early, sometimes too late
  • the breadth-first scheduling of that work (400x "A" steps, followed by 400x "B" steps, etc) causes a lot of identical operations to be done in a big batch, exacerbating the big-vs-small computron problem
    • if we did one A/B/C/D/E, then the second A/B/C/D/E, then the third, etc, we'd get more statistical averaging of big/small computrons, making the 65Mc/block limit more meaningful
  • there might have been additional slowdowns due to organic GC time, we're not sure
    • if so, this would probably be a consequence of the vats being "large" (lots of objects), which is a consequence of insufficient use of virtual objects on high-cardinality data, like wallets
  • there were certainly slowdowns due to the XS "dusty worker degradation" problem (Slowdown of XS worker #6661), which doesn't increase the number of computrons consumed, but does increase the wallclock required, and most validators had not rebooted recently
    • this could also exacerbate the difference between validators: recently-rebooted "clean" ones might have gotten through those large blocks more quickly than non-recently-rebooted "dusty" ones

To address the per-wallet work, we need to make things more lazy, and/or look for ways to just do less work per vat.

To address the XS metering precision, we should sort the cranks by their wallclock / computrons ratio, then look carefully at the biggest vs the smallest, and try to figure out which low-level JS operations were involved in each. Perhaps we could run a lot of the same-sized deliveries in a loop, while we point a profiler at the worker, and build up a flamegraph of which XS .c functions get called during each. If we're lucky there may be a few specific XS functions that account for lots of wallclock but not a lot of computrons, and we can change their metering behavior to increase the computron consumption. Likewise there might be code that increments the meter a lot but actually runs very quickly, which could use adjusting in the opposite direction.

The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.

We need to replay these vats and add GC instrumentation measure how frequently organic GC is happening, how long it takes, and whether this time was a significant contributor to the wallclock/computron variance. We might address this through XS changes, but my hunch is that finishing virtualization of the vats involved will make the most difference, to shrink the JS heaps and keep GC fast.

Anything we do to speed things up or do less work will help with problem 1 (the chain being unavailable for 35 minutes). Users care more about that.

Anything we can do to make execution more uniform (and make the per-block computron limit more effective) will help with problem 2 (the handful of long blocks, causing voting/consensus/stability problems). Validators care more about that.

@mhofman
Copy link
Member

mhofman commented Jan 13, 2023

The breadth-first scheduling doesn't have an obvious fix. When we have a more sophisticated kernel scheduler, we might be able to do better, but the small number of vats involved means the kernel wouldn't have a lot of criteria with which it could distinguish between work done for user 1 vs user 2, etc.

I believe @dtribble's suggestion to push new work generated by a vat on the top of the runqueue instead of the bottom would have effectively prevented this breadth-first issue.

@dckc dckc added the performance Performance related issues label Jan 13, 2023
@dckc
Copy link
Member

dckc commented Jan 23, 2023

historical note: it looks like we knew about #6652 as far back as May 2022

We just didn't schedule that work for the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chain-incident performance Performance related issues vaults_triage DO NOT USE
Projects
None yet
Development

No branches or pull requests

4 participants